by Model of

Stock Price Prediction Deep Neural Generative SUMMARY In this study, we propose a deep neural generative model for predicting daily stock price movements given news articles. Approaches involving conventional technical analysis have been investigated to identify certain patterns in past price movements, which in turn helps to predict future price movements. However, the ﬁnancial market is highly sensitive to speciﬁc events, including corporate buyouts, product releases, and the like. Therefore, recent research has focused on modeling relation- ships between these events that appear in the news articles and future price movements; however, a very large number of news articles are published daily, each article containing rich information, which results in overﬁtting to past price movements used for parameter adjustment. Given the above, we propose a model based on a generative model of news articles that includes price movement as a condition, thereby avoiding excessive overﬁt- ting thanks to the nature of the generative model. We evaluate our proposed model using historical price movements of Nikkei 225 and Standard & Poor’s 500 Stock Index, conﬁrming that our model predicts future price movements better than such conventional classiﬁers as support vector ma- chines and multilayer perceptrons. Further, our proposed model extracts signiﬁcant words from news articles that are directly related to future stock price movements.


Introduction
Financial markets occupy an important position in modern society; as such, arguably no one is independent of the financial climate. Successfully predicting price movements of financial commodities (i.e., stocks, debt obligations, and derivatives) can potentially avoid the harmful effects that a pending financial crisis could have on daily life, as well as provide economic earnings. Many mathematical models of price movements have been proposed in the past (e.g., [1]). Research has also focused on technical analysis methods for identifying certain patterns in past price movements to predict future price movements [2]- [6]; this is typically founded on the belief that the same price movement patterns are often repeated, but the financial market is not a closed system and is instead very sensitive to various events, including corporate buyouts, new product releases, and the like. Even with their success in predicting long-term trends, the pattern-based approaches are limited in terms of predict- ing event-related and short-term price movements. The financial market is considered to be efficient [7]. Investors read all publicly available information, such as news articles, to learn about events both in the past and the future, then react via the corresponding financial commodity. The price of the financial commodity then responds to these events rather quickly. Based on this relationship, machine learning approaches have been directed to predict price movements by employing classifiers given news articles [8]- [14]. In other words, these approaches build a model describing the indirect influence that news articles have on prices; however, a large number of news articles are published daily, and each article contains rich information regardless of whether or not it is related to the financial market. More specifically, machine learning approaches are given numerous news articles as explanatory variables that explain the limited number of price movements; such approaches are prone to overfitting to price movements used for parameter adjustment and have not been generalized to accurate prediction of future price movements. In addition, such a model is often a black box, i.e., the model does not provide a reason for its predictions. To counter this problem, several studies have focused on extracting limited features from news articles, such as the number of occurrences of hand-selected words; the other study have averaged features across many news articles [9], [10], [13]. Using these feature selection approaches have the risk of corrupting the information contained in the original news articles.
Given the above, in this paper, we propose a deep neural generative model (DGM) of news articles to predict the price movements; to our knowledge, this is the first time a DGM has been used to tackle such a problem. The DGM is an implementation of generative model on deep neural networks [15]- [17]. Our proposed model generates news articles embedded to vectors [18], [19] given the assumption of future price movement as a condition. Thanks to the nature of generative modeling, our proposed model is expected to have a lower risk of overfitting to past price movements for training.
We evaluate our proposed model using historical datasets of Nikkei 225 (Nikkei Stock Average) and Standard & Poor's 500 Stock Index, as well as related news articles. Our experimental results demonstrate that our proposed model better predicts of the movements of these stock indices versus two key conventionally baseline methods, i.e., support vector machines (SVMs) [3], [4], [9], [10] and multilayer perceptrons (MLPs) [11]- [14]. Results of a sim- plified market simulation [8], [13] also demonstrate that our proposed model is more capable of making profit versus baseline methods. Finally, arithmetic operations on the generated vectors that embed artificial news articles confirm that our proposed model can visualize significant words contributing to the prediction, i.e., words determining sentiments of the given news articles.

Generative Model of News Articles and Stock Prices
In this study, we propose a generative model p θ (x) of a set of news articles x = {x i } published in one day for the binary prediction of stock price movements with output y indicating either a predicted increase (i.e., y = +1) or a predicted decline (i.e., y = −1). On each day, events i = 0, 1, . . . related to the company and the society at large occur independently and appear in news articles x = {x i }. In our model, each news article x i contains economic information s i ∈ {+1, −1} and neutral information z i ∈ R n z . Economic information s i is related to the company's performance and directly motivates investment behavior, whereas neutral information z i is the remaining information that does not directly motivate investment behavior, including the company's name, business type, etc.
Given the above, in this study, we propose the following generative model of news article x i , which we also present in Fig. 1 (a); the generative model is where p(s i ) and p(z i ) are the prior distributions of economic information s i and neutral information z i of individual news article x i , respectively. Investors find economic information s = {s i } by reading a set of news articles x = {x i } published on one day, then respond by purchasing or short-selling stocks, all of which influence the stock price on the following day. In other words, economic information s influences stock price movement y ∈ {+1, −1} on the following day through investor behavior (as denoted by the solid arrow across the border of the plate in Fig. 1 (a)). Although it is difficult for non-experts to find economic information s i from news article x i , we approximate economic information s i by the stock price movement y on the following day.
As a result, as shown in Fig. 1 (b), we consider a conditional generative model log p θ (x|y = k) of a set of the news articles x given stock price movement y on the following day as: Based on the variational method, the conditional generative model log p θ (x|y = k) can be modeled with conditional probability q φ as: where D KL (·||·) is the Kullback-Leibler divergence and −L k (x) is the lower bound.

Predicting of Stock Prices and the Objective Function
Future price movement y is known in the training dataset, whereas in the testing dataset, future price movement y is not known and is to be predicted. We first approximate loglikelihood log p θ (x|y = k) of the set of news articles x = {x i } by its lower bound L k (x). Next, using Bayes' theorem, we obtain an approximation of log-probability log p(y = k|x) of the assumption that future stock price movement y takes k ∈ {+1, −1} as follows: where we assume that prior probability p(y = k) of stock price movement y is 0.5. From the approximation, when we find −L +1 (x) > −L −1 (x), we consider log p θ (y = +1|x) > log p θ (y = −1|x) and predict an increase in the stock price on the following day. In short, the summation of the loglikelihood i log p θ (x i |y = k) of each news article x i published on a given day matters since log From the above, we propose the objective function below, which must be minimized: Here, our proposed model can be trained as a conditional generative model or a classifier by adjusting ω as follows: Our proposed model is a conditional generative model as long as ω = 0.0, whereas our proposed model with ω = 1.0 is then a classifier. When 0.0 < ω < 1.0, our proposed model works as either a classifier or a generative model with a penalty term.

Deep Neural Generative Model
As noted above, we propose a new deep neural generative model (DGM), which is an implementation of a generative model using deep neural networks [15]- [17]. More specifically, we implement the aforementioned conditional generative model log p θ (x|y = k) using two neural networks called an encoder and a decoder, as depicted in Fig. 2. The encoder is given news article x i and assumption of stock price movement y, then outputs parameters of posterior distribution q φ (z i |x i , y) of neutral information z i , inferring posterior distribution q φ (z i |x i , y). The decoder is given assumption of stock price movement y and a sample z i from posterior distribution q φ (z i |x i , y), then generates posterior distribution p θ (x i |z i , y) of news article x i . The encoder and decoder have u h hidden layers. The l-th hidden layer consists of n (l) h units followed by layer normalization [20] and the ReLU activation function [21]. The encoder accepts news article x embedded to an n xdimensional vector at its first hidden layer and assumption of stock price movement y at its last hidden layer. The output layer of the encoder consists of 2 × n z units, with half of the units followed by no activation function and used as vector μ z i and the other half of the units followed by the exponential function and used as vector σ z i . Vectors μ z i and σ z i are used as parameters of a multivariate Gaussian distribution with a diagonal covariance matrix that represents posterior distribution q φ (z i |x i , y) of neutral information z i corresponding to news article x i . Next, the decoder accepts samples from posterior distribution q φ (z i |x i , y) and assumption of stock price movement y at its first hidden layer. The output layer of the decoder consists of 2×n x units that are used as parameters of a multivariate Gaussian distribution with a diagonal covariance matrix that represents posterior distribution p θ (x i |z i , y) of news article x i in the same way as the encoder.
Using the above, we sampled latent variable z from posterior distribution q φ (z i |x i , y) exactly c = 5 times and calculated lower bound −L k of conditional log-likelihood log p(x i |y = k) using importance weighted sampling [22]. The encoder and decoder were jointly trained using the Adam optimization algorithm [23] with parameters α = 10 −4 , β 1 = 0.9, and β 2 = 0.999. We selected hyperparameters from u h ∈ {1, 2}, n (l) h ∈ {50, 100, 200, 400}, n z ∈ {5, 10, 25, 50, 100}, and ω ∈ {0.0, . . . , 0.5, . . . , 1.0}, where n (l) h ≤ n (l ) h for l > l . Our proposed DGM was given a single article at a time and was trained via the objective function in Eq. (3); we call this the single article. We also evaluated a simple preprocessing approach that was used in previous studies [12], [13], i.e., we calculated vectorx as the average of all vectors x = {x i }, each embedding a news article published in a given day; we call this the averaged article.

Models for Comparison
As noted previously, our proposed DGM considers binary change y in the stock prices as a sample from prior distribution p(y) rather than as a result of news articles x. For comparison, we also evaluated straightforward classifiers p(y|x) used in previous studies as baselines, in particular multilayer perceptrons (MLPs) [11]- [14] and support vector machines (SVMs) [3], [4], [9], [10], each of which we further describe below.
The MLP that we implemented had hidden layers, each of which consisted of n (l) h units followed by layer normalization [20] and the ReLU activation function [21], just like our proposed DGM. The MLP had a single output unit followed by the logistic function. For the single article, the output of the MLP represented posterior probability q φ (y = +1|x i ) of the increase in stock price given single news article x i . The objective function to be minimized was cross-entropy is the indicator function that returns 1 if cond. is true and 0 otherwise. The other conditions were the same as those for our proposed DGM.
Once the MLP was trained, we sequentially inputed a set of news articles x = {x i } published in a given day and predicted stock price movement using the summation of the log-probability using the same approach as our proposed was considered to imply q φ (y = +1|x) > q φ (y = −1|x) and vice versa. For the averaged article, we trained the MLP using the same approach as that used for the single article, and single output q φ (y|x) was used as the prediction for the given day.
As for the SVM, we only trained the SVM for the averaged article because convergence of training took a very long time for the single article. Further, we selected parameter C by trading off between classification accuracy and margin maximization from C ∈ {. . . , 1, 2, 5, 10, 20, 50, 100, . . . }. Note that we also evaluated the fully generative version of our proposed DGM, i.e., our proposed DGM with a hyperparameter of ω = 0.0.

Datasets
We evaluated our proposed DGM and the other models on two datasets, i.e., Nikkei and S&P 500. The Nikkei dataset includes Nikkei 225 daily stock index (Nikkei Stock Average) and news articles from the morning edition of the Nihon Keizai Shimbun (Nikkei) newspaper, which is in Japanese. Just as in the previous studies [12], [24], we only extracted the titles from these news articles. Titles were preprocessed using the morphological analyzer MeCab [25] and embedded to n x -dimensional vectors using the Paragraph Vector algorithm [19]. The Paragraph Vector algorithm used a vector length of 200, negative sampling, a window size of 3, and a learning rate that linearly decreased from 0.1 to 0.0001 over 300 iterations. Since we obtained the news articles from the morning edition, we were able to use these news articles to predict the price movements on the same day. Increase y = +1 (or decline y = −1) in the stock index was defined as being the situation in which the closing price was higher (or lower) than the corresponding opening price.
We obtained the S&P 500 dataset from [26]; it includes Standard & Poor's 500 Stock Index and financial web news obtained from both Reuters [27] and Bloomberg [28], each of which is in English. As with the Nikkei dataset, we also embedded the titles to vectors using the Paragraph Vector algorithm. Further, we used these news articles to predict price movements on the next business day.

Prediction Performance
To evaluate the prediction performance for the datasets, each dataset was divided into three subsets for training, validation, and testing. Table 1 summarizes the time period and the number of articles for each of these subsets. The prediction performance of each of these models was evaluated by measuring prediction accuracy, which is defined as where T P, T N, FP, and FN are the number of true positives, true negatives, false positives, and false negatives. Since classes y = +1 and y = −1 differ in terms of size, we also used the Matthews Correlation Coefficient (MCC) to evaluate prediction performance. The MCC is a wellknown balanced measure defined as
We trained each of the models using the corresponding training subset and various hyperparameters, where we adjusted the imbalances in the classes and the imbalance in the number of news articles per day via oversampling. The hyperparameters were selected according to the MCC for the validation subset. Once the hyperparameters were selected, we retrained the models using both the training and validation subsets. Table 2 summarizes the selected hyperparameters, with the numbers in square brackets indicating the number of the units of the encoders of our proposed DGMs or the MLPs. Tables 3 and 4 summarize prediction performance, with sin. and ave. indicating the single article and the averaged article scenarios. Note that our proposed DGM with a modified objective function of ω = 0.9 achieved the best prediction accuracy for the testing subsets of both datasets for the single article scenario.

Market Simulation
We also evaluated performance of the models based on whether or not they were capable of making a profit. Following the framework in the previous study [8], we simulated a day trader strategy using each of the models to predict a stock index. More specifically, the day trader creates a portfolio of securities that represents a stock index or considers a financial derivative product that tracks a stock index. If the model predicts an increase in the price of a stock index on a given business day, the simulated day trader purchases a unit of the derivative product at the opening price, holds it for the business hours, then sells it at the closing price. If the stock index will be changed by more than a certain threshold θ during the business hours, our simulated day trader immediately sells the derivative product for a profittaking or a loss-cutting. Our simulated day trader uses the same strategy for short-selling. We evaluated the capacity of making profit for the testing subset of each dataset with thresholds θ = 1% and θ = 2%. For comparison, we also examined trivial strategies, including random trading (buying and selling with an equal probability of 0.5), all-buying (buying with a probability of 1.0 and selling with a probabil- ity of 0.0), and all-selling (selling with a probability of 1.0 and buying with a probability of 0.0). Table 5 summarizes our results, comparing random trading and all-buying/allselling with our other models. Note that random trading had a mean of zero because the market simulation did not take into account trading commissions. Further, we added the standard deviations of the profits of the random trading strategy. Our proposed DGM with a modified objective function of ω = 0.9 for the single article showed the largest profit between both datasets. Following this were the SVM and the MLP for the averaged article.

Vector Representation of News Articles
In the previous subsections, we have demonstrated that our proposed DGM was able to predict stock price movement y according to news articles embedded to vectors x = {x i } using the Paragraph Vector algorithm [19]. Therefore, our proposed DGM can be described as extracting features (i.e., economic information s) related to stock price movement y from vectors x. In this subsection, we consider artificial news articlex i that describes the same neutral information z i and opposite economic informations i = −s i . The words and meanings that vary between original article x i and paired artificial news articlex i are considered to be related to economic information s i and stock price movement y. Given this, we attempted to extract vector r i that represents such features from news article x i as: Here, artificial news article represented by vectorx i = x i − r i is expected to represent opposite economic informations i = −s i . Therefore, we selected several news articles x i from each testing subset; next, we generated vectors x i = x i − r i that represent artificial news articles, and retrieved news articles embedded to the vectors that are closest to the vectorsx i in cosine similarity. Table 6 summarizes our results.

Discussion
Almost every model evaluated in our present study achieved an accuracy close to 100% for the training subset and a good MCC (> 0.2) for the validation subset with designed hyperparameters. Unfortunately, this does not necessarily indicate that the model predicts stock price movements well. Each model save for our proposed DGM is a classifier and is prone to overfitting to the training subset. Such models sometimes predicted the validation subset well by chance but were not generalized to the testing subset. The MLP and our proposed DGM with ω = 0.0 achieved relatively better generalization abilities for the averaged article than for the single article, because the number of input data points was reduced and the models had a lower risk of overfitting for the averaged article. Our proposed DGM with ω = 0.0 modeled a given dataset regardless of the condition y owing to the high flexibility of deep neural networks; this resulted in limited classification accuracy. In contrast, only our proposed DGM with ω > 0.0 demonstrated significant generalization abilities for the single article, showing the effectiveness of our proposed objective function in Eq. (3). Our proposed DGM with ω > 0.0 can be interpreted as a classifier with a penalty term derived from a formulation of a generative model; therefore, it achieved better results than SVM, MLP, and our proposed DGM with ω = 0.0.
According to the bottom row of Table 1, stock prices increased for approximately 55% of the days in each subset of the S&P 500 dataset. When predictions of a model were biased to stock price increases, the model was able to achieve accuracy levels better than 50%; however, the model with the biased predictions was not given a good evaluation based on the MCC since the MCC is a balanced measure. The SVM and the MLP achieved better accuracy levels and insignificant MCCs for the S&P 500 dataset, implying that their accuracy levels of better than 50% for the S&P 500 dataset were due to biased predictions. Conversely, the DGM with α = 0.0 achieved accuracy levels of almost 50%. We highlight here that the DGM with α = 0.0 was the only generative model in this study; further, its objective was to reconstruct given news articles x i , and its predictions were almost balanced because of its prior probability p(y = +1) = p(y = −1) = 0.5. In addition, the MCC scores of the DGM with α = 0.0 were almost zero, just like the SVM and MLP, indicating that the DGM with α = 0.0 failed in predicting the stock price movements, which is why the DGM with α = 0.0 achieved accuracy levels of almost 50%. Finally, we note that since the Nikkei dataset is relatively balanced, the SVM, the MLP, and the DGM with α = 0.0 all achieved accuracies of almost 50% in the Nikkei dataset.
In the Nikkei dataset with a threshold of θ = 2%, the MLP for the averaged article earned a profit of 16.57% which is comparable to the 16.63% profit earned by our proposed DGM for the single article despite the large difference in prediction accuracy between our proposed DGM (i.e., 56.4%) and the MLP (i.e., 50.4%). With a threshold of θ = 1%, the profit earned by the MLP increased only slightly to 16.66%, whereas the profit earned by our proposed DGM increased substantially to 37.06%. All models were trained to predict the binary change y in stock price independent of the amount of change. Therefore, higher prediction accuracy implies a higher expected profit, but this does not guarantee a higher profit because of the large variance of profits per day. If a model predicts a small change successfully and fails in predicting of a wide swing, the model loses a great deal in spite of its 50% prediction accuracy. We observed that this is why several models earned huge profits or suffered huge losses with a threshold θ = 2% in spite of a prediction accuracy of approximately 50%. Following previous studies [8], [13], our market simulation held changes within threshold θ; with a lower threshold θ, the variance of the change in stock price was suppressed and the variance of the profit decreased, resulting in the order of profit resembling the order of prediction accuracy (see also the standard deviation (std.) results in Table 5). Therefore, the profit earned by our proposed DGM with a threshold of θ = 1% is more significant than that earned with a threshold of θ = 2%. The quantitative prediction of stock price movements and market simulations based on such predictions are both beyond the scope of this present study, but we include them as possible future work [29]. A market simulation considering commissions and time lags inherent in trading is also future work [30].
In Sect. 3.5, according to the original news article x i , we found the paired news article closest to vectorx i = x i − r i in cosine similarity. Paired news articlex i is thereby expected to represent a news article describing similar neutral information z i with opposite economic informations = −s. The original and paired news articles share words such as " (net income)", " (billion yen)", " (rising yen)", "revenue", "housing", and "percent." These words can be considered to correspond to neutral information z i unrelated to stock price movement y. In contrast, the original and paired news articles also include words with opposite meanings to one another, such as " (increase)" versus " (decline)" and "starts up" versus "starts down." These word pairs can be considered to correspond to economic information s that are directly related to stock price movement y. Both "increase" and "starts up" do not always have positive meanings, but they are positive when they are used with other positive words, such as "income", "price", and "housing." Therefore, our proposed DGM understands the language model obtained from the Paragraph Vector algorithm [19] and separated features, depending on their relation to stock price movement y.
Turning our attention to viewpoint of graphical model, we can view the MLP as an implicit model in which stock price y is a result of the reactions of investors to news articles x. We also consider the model shown in Fig. 1 (a), in which the latent variables z and s influence news articles x and stock price y and are inferred from them using an approach similar to that used in supervised topic models [31]. This model also easily overfits to the training subset due to the massive number of explanatory variables in spite of limited explained variable y. Note that this is why we treat stock price movement y as a condition rather than a result.
Finally, society and financial markets are far more complicated than even our best models. As future work, we plan to consider a hierarchical generative model that generates a set of new articles describing the same event and includes individual companies, individual stock prices, multiple stock markets, and the various relationships among all of these components. Our proposed DGM only predicted stock indices on the following day, but can arguably be extended to handle temporal dynamics and the delayed and long-term influences of news articles and stock prices. As such, we also include this as part of our future work.

Conclusion
In this study, we propose a deep neural generative model of news articles to predict the stock price movements. Our proposed model is given assumption of future price movements as a condition and generates news articles embedded to vectors. We evaluated our proposed model and other models for comparison by applying them to historical datasets of stock indices. Our proposed model achieved the highest prediction accuracy among all of the models, as well as the most capable of earning a profit in our market simulations. Our proposed model extracts features from news articles embedded to vectors depending on their relation to stock price movements.