GCN-based stock relations analysis for stock market prediction

Cheng Zhao; Xiaohui Liu; Jie Zhou; Yuefeng Cen; Xiaomin Yao

doi:10.7717/peerj-cs.1057

GCN-based stock relations analysis for stock market prediction

Cheng Zhao¹, Xiaohui Liu², Jie Zhou¹, Yuefeng Cen³, Xiaomin Yao ⁴

1School of Economics, Zhejiang University of Technology, Hangzhou, Zhejiang, China

2College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, Zhejiang, China

3School of Information and Electronic Engineering, Zhejiang University of Science & Technology, Hangzhou, Zhejiang, China

4College of Entrepreneurship, Zhejiang University of Technology, Hangzhou, Zhejiang, China

DOI: 10.7717/peerj-cs.1057

Published: 2022-08-11
Accepted: 2022-07-14
Received: 2022-03-27

Academic Editor: Qichun Zhang

Subject Areas: Artificial Intelligence, Data Mining and Machine Learning, Graphics
Keywords: Stock prediction, Multi-factor, Stock relation, Time series, Graph-based learning, LSTM

Copyright: © 2022 Zhao et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Zhao C, Liu X, Zhou J, Cen Y, Yao X. 2022. GCN-based stock relations analysis for stock market prediction. PeerJ Computer Science 8:e1057 https://doi.org/10.7717/peerj-cs.1057

Abstract

Most stock price predictive models merely rely on the target stock’s historical information to forecast future prices, where the linkage effects between stocks are neglected. However, a group of prior studies has shown that the leverage of correlations between stocks could significantly improve the predictions. This article proposes a unified time-series relational multi-factor model (TRMF), which composes a self-generating relations (SGR) algorithm that can extract relational features automatically. In addition, the TRMF model integrates stock relations with other multiple dimensional features for the price prediction compared to extant works. Experimental validations are performed on the NYSE and NASDAQ data, where the model is compared with the popular methods such as attention Long Short-Term Memory network (Attn-LSTM), Support Vector Regression (SVR), and multi-factor framework (MF). Results show that compared with these extant methods, our model has a higher expected cumulative return rate and a lower risk of return volatility.

Introduction

According to the statistics of the World Bank, in 2021, the ratio of global stock market value to GDP has exceeded 130% (https://data.worldbank.org.cn/indicator/CM.MKT.LCAP.GD.ZS), which means that the stock market has become one of the most popular investment channels. As defined in the capital asset pricing model (Fama & French, 2004), the returns obtained by an investor in the stock market mainly comprise two parts: alpha returns (the part of individual returns that does not fluctuate with the market) and beta returns (the part of systematic returns that follows market fluctuations). The former depends on a company’s long-term operation, the analysis of which is best performed by humans. In contrast, the latter fluctuates sharply with the market in the short term, which is more suitable for technical analysis using quantitative models. Based on stock price characteristics, quantitative models search for undervalued stocks and profit by their value return.

Traditionally, quantitative models based on deep learning treat the transaction data of a single stock as a time series (Cao, Li & Li, 2019; Bao, Yue & Rao, 2017; Narayan, Phan & Liu, 2021; Greenwood & Shleifer, 2014; Zhang, Ashuri & Deng, 2017; Dai, Shao & Lu, 2013; Kazem et al., 2013). The recurrent neural network (RNN) and its various variants that are good at modeling sequence achieved good performance in stock price prediction (Bao, Yue & Rao, 2017; Cao, Li & Li, 2019; Chen & Ge, 2019; Fischer & Krauss, 2018; Pang et al., 2020; Zhang, Aggarwal & Qi, 2017; Liu, Li & Liu, 2021). However, they treat stocks as separate entities, ignoring their relations and the integrity of the stock market.

In fact, the prices of related stocks significantly impact each other. As shown in Fig. 1A, Facebook and Google belong to the same industry, and their stock prices show the same trend. Therefore, stock relations have been used to enhance the prediction accuracy with the development of graph learning. However, most researchers directly and simply use industry relations provided by third-party platforms, which are insufficient. As shown in Fig. 1B, the stock prices of NXPI and XEL have shown opposite trends for a long time. Because they represent the new energy vehicle industry and the traditional fossil energy industry, respectively, there is a competitive relationship between them. However, third-party platforms cannot provide such complex relationships.

Figure 1: The impact of stock relations on stock prices.
The horizontal axis represents a period of trading time, and the vertical axis represents the normalized stock price.

Download full-size image

DOI: 10.7717/peerjcs.1057/fig-1

The motivation of the study is to solve the problems of incomplete information and insufficient stock relations in the existing stock price prediction methods mentioned above. Therefore, in the paper, we design the algorithm to mine stock relations more effectively and integrate them with multiple information to improve the prediction. The key contributions include: (1) We designed the self-generating relations algorithm (SGR) that uses trading data to generate stock relations. Meanwhile, we modified the traditional graph convolutional neural network (GCN) to model the influence of stock relations on price; (2) We proposed the time series relational multi-factor model (TRMF) model that combines multiple dimension information for stock price prediction; (3) We demonstrated the effectiveness and universality of TRMF with data from the two most developed stock markets, New York Stock Exchange (NYSE) and NASDAQ Stock Market (NASDAQ).

The remainder of this article is organized as follows: the “Related Work” section reviews research on the technical analysis of stock price predictions and stock relations; then the “Method” section details the structure of the TRMF model, especially the SGR algorithm and the weighted graph convolutional neural network (WGCN) module; in the “Experimental Setting” section, experimental data and model parameters are listed; furthermore, the “Results” section presents the back-testing results of the proposed model and its comparison with other models; finally, the “Conclusion” section concludes the effectiveness of the SGR algorithm and the TRMF model on the stock price prediction.

Related Work

There is a linkage effect between the stocks in the market, which has been widely proved in academics (Baruník & Křehlík, 2018; Diebold & Yilmaz, 2014; Ferrer et al., 2018; Ma et al., 2022; Nasreen et al., 2020; So, Chu & Chan, 2021). Among them, Diebold & Yilmaz (2014) proposed several connectedness measures built from pieces of variance decompositions and tracked daily time-varying connectedness of major US financial institutions’ stock return volatilities on the financial crisis of 2007–2008. Ferrer et al. (2018) demonstrated the time and frequency dynamic connectedness among oil prices, stock returns of American clean energy companies and several vital financial variables with the method proposed by Baruník & Křehlík (2018). Nasreen et al. (2020) analysed the dynamics of connectedness between fossil energy prices and stock market returns of clean energy and technology companies. The results demonstrated that the connectedness volatility was transmitted at all frequencies and over the whole experimental period. So, Chu & Chan (2021) studied the impacts of the COVID-19 pandemic on the connectedness of the Hong Kong financial market. Compared to other crises where the network density and clustering can be explained by co-movement with market indices as in normal periods, they found that both network density and clustering were higher in the partial correlation networks during the COVID-19 outbreak. Despite the extensive literature demonstrating the universality of connectedness among stocks, quantitative models do not take them into account.

Traditionally, researchers treat stock price prediction as a time-series problem and solve it with classic RNN models such as LSTM. Existing researches show that LSTM can effectively extract time-series information and has good performance in stock price prediction (Cao, Li & Li, 2019; Chen & Ge, 2019; Fischer & Krauss, 2018; Kim & Won, 2018; Zhang, Aggarwal & Qi, 2017; Bao, Yue & Rao, 2017; Tang et al., 2020; Rezaei, Faaljou & Mansourfar, 2021; Zhao et al., 2021). For example, Fischer & Krauss (2018) made an earlier attempt to use LSTMs for stock selection of S&P 500 constituents based on the time series characteristics. The experimental results found LSTM networks to outperform memory-free classification methods, i.e., a random forest (RAF), a deep neural net (DNN), and a logistic regression classifier (LOG). Cao, Li & Li (2019) proposed two hybrid forecasting models which combine the two kinds of empirical mode decomposition (EMD) with LSTM to improve the accuracy of the stock market prices forecasting. Compared with the single LSTM model, support vector machine (SVM), multi-layer perceptron (MLP), and other hybrid models, the experimental results showed that the proposed models displayed better performance in one-step-ahead forecasting of financial time series. Similarly, Zhang, Aggarwal & Qi (2017) proposed the State Frequency Memory (SFM) recurrent network, which decomposed the hidden states of memory cells into multiple frequency components. Each component models a specific frequency of latent trading patterns. Kim & Won (2018) combined the LSTM model with various generalized autoregressive conditional heteroscedasticity(GARCH)-type models for stock price prediction. They discovered that the hybrid model combining the LSTM model with three GARCH-type models had the best performance in terms of many metrics. Currently, researchers have found that the attention mechanism exhibits excellent performance in long sequence learning (Vaswani et al., 2017; Yu et al., 2021). Therefore, Chen & Ge (2019) explored the attention mechanism in LSTM network-based stock price movement prediction. The experimental results in Hong Kong stock movement prediction demonstrated the effectiveness. However, as mentioned above, most of these traditional solutions do not incorporate stock relations into the predictive models.

Incorporating stock relations into stock price prediction is a relatively new research direction (Feng et al., 2019; Chen, Wei & Huang, 2018; Long et al., 2020). Typically, Feng et al. (2019) obtained the industry classification data of the NASDAQ and NYSE markets in the United States from the website (https://www.nasdaq.com/screening/industries.aspx) to construct a graph and input it to the GCN. The experimental results showed that this method outperformed the traditional LSTM model in backtesting returns. In addition, Chen, Wei & Huang (2018) obtained the stock relations from the Wind. (https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON), and used the relationship transfer method to increase the number of stock relations. The experimental results demonstrated that increasing the number of stock relations can improve the performance of the prediction. Nonetheless, third-party platforms can only provide industry relations, which are not comprehensive enough.

Scarselli et al. (2009) firstly proposed GCN, extending deep neural networks to graph category. Years later, Bruna et al. (2014) performed convolution operations on the frequency domain to capture the local connection patterns in graphs. Based on their method, several works have been done for accelerating (Defferrard, Bresson & Vandergheynst, 2016; Kipf & Welling, 2017). Among them, Kipf & Welling (2017) simplified GCN via local first-order approximation based on spectral graph convolutions and demonstrated that the method outperformed existing methods in experiments on citation network and knowledge graph datasets. GCN is used for single relational networks between entities (Wu et al., 2020; Fout et al., 2017; Zhao et al., 2020). However, there may be multiple relations between two stocks. For example, a machine tool manufacturer A may use the bolts provided by enterprise B. At the same time, A supplies B with machine tools for producing bolts. Then there are two relations between A and B: A is both the downstream shipper of B and the supplier of B. Therefore, the classical GCN needs to be adapted to the characteristics of enterprise relations for stock price prediction.

Method

Acquiring and utilizing stock relations is the critical component of the TRMF model. In this article, our proposed SGR algorithm uses transaction data to obtain stock relations. Meanwhile, the WGCN module models the impact of relationships on the stock price prediction.

Self-generating relation algorithm

We consider determining whether relations exist between two stocks and how to design rules to capture them. Here, we refer to realistic stock relations as entity relations, such as industrial relations and competitive relations. A stock may have entity relations with many other stocks. The combined relations ultimately act on stock pairs’ trading behavior, especially prices in different periods. So that their prices show a specific pattern, such as two stocks’ closing prices rising or falling together or one stock’s five-day average price rising while another falls. Therefore, the correlation between two stock prices reflects a combination of multiple relations between them. There is a mapping: (1) $(R_{i, j}^{1}, R_{i, j}^{2}, \dots, R_{i, j}^{N}) \to_{}^{f} r_{i, j}^{k}$ where R_i,j denotes an entity relation between stock i and stock j, $r_{i, j}^{k}$ denotes the correlation between stock i and stock j in certain trading data. Both are Bool data, represented by 0 and 1.

 
 
 Data: stocks S, trading days Tr, trading features F, minimum support 
              min_support 
    Result:  Stock relation adjacency matrix Asgr[len(S)][len(S)][len(F)] 
  1  Function gen_Stocks_Feature_Event(feature): 
     2   Event_Bit_Set[len(S)][len(Tr)] = {0} 
3   for t in Tr do 
    4   for s in S do 
5   if trading feature of stock s in day t meet feature threshold 
   then 
6   Event_Bit_Set[s][t] = 1 
7   end 
8   end 
9   end 
10   return Event_Bit_Set 
11 
12  for f in F do 
     13   E = gen_Stocks_Feature_Event(S, Tr, f) 
14   for (s1,s2) pair from S do 
    15   dis = Hamming Distance between E[s1] and E[s2] 
16   support = (len(Tr) - dis) / len(Tr) 
17   if support >min_support then 
18   Asgr[s1][s2][f] = 1 
19   else 
20   Asgr[s1][s2][f] = 0 
21   end 
22   end 
23  end 
 
              Algorithm 1: Self-generating relation algorithm

Therefore, we use the correlation of stock prices, namely self-generating relations (SG-relations), as an alternative to entity relations. The SGR algorithm mines stocks’ SG-relations. Its input is multiple trading features F, and the output is a 3D adjacency matrix A_sgr. The flow of the algorithm is represented in pseudocode in Algorithm 1. In short, two stocks have a SG-relation on a trading feature, which means they meet the same threshold on the feature for enough days. The threshold here can be similar to that the closing price is greater than 0, the five-day average price is less than 0, and so on.

Weighted graph convolutional neural network

GCN requires two inputs: an adjacency matrix A representing the relationship, and the other is a matrix H representing stocks’ time-series features. We use a simplified GCN proposed by Kipf & Welling (2017) which is the state-of-the-art formulation (Chen, Wei & Huang, 2018; Feng et al., 2019). It consists of two convolutional layers, one for input-to-hidden and the other for hidden-to-output: (2) $A = D - 1 / 2 (A + I) D - 1 / 2 Y = f (X, A) = s o f t m a x (A (R e L U (A X W 0)) W 1)$

Here, A + I is the adjacency matrix of stock relations undirected graph G with added self-connections, D ∈ ℝ^N×N is the degree matrix of G.W0 ∈ ℝ^C₁×C₃ is an input-to-hidden weight matrix, and W1 ∈ ℝ^C₃×C₄ is the hidden-to-output weight matrix.

Note that A is a 2D adjacency matrix, which means only a single relation between stock nodes. However, two stocks may have multiple SG-relations as defined in the previous section. Therefore, we designed the WGCN to utilize relational information. It adds a weight layer to the traditional GCN to map various SG-relations into a synthetic relation: (3) $A {s g r}^{(N, N, n_{r})} \to W r^{(n_{r}, 1)} A (N, N) A = W r A s g r$ where A_sgr denotes the adjacency matrix of multiple SG-relations, and W_r denotes the weight vector.

Table B1:

Evaluation metrics.

Here, o_t is open price of day t, p is the principal. r_up is a local high point of return ratio and r_down is a local low point. r_p is return of portfolio, r_f is risk-free rate, and δ_p is standard deviation of the portfolio’s excess return.

Metrics	Formula
IRR	$\sum_{t = 1}^{n} \frac{o_{t} - o_{t - 1}}{p}$
MDD	$max (\frac{r_{u p} - r_{d o w n}}{r_{u p}})$
SR	$\frac{r_{p} - r_{f}}{δ_{p}}$

DOI: 10.7717/peerjcs.1057/table-B1

Figure 2 illustrates mapping the entity relations to a synthetic relation. Suppose that A is a traditional auto company actively turning to new energy, and B is a power battery company. They used to have mutually exclusive competition and are gradually increasing cooperation. These complex entity relationships are reflected in market transactions, which may be the reverse trend of the 60-day average price because it reflects longer periods. It may also show the same trend of the daily trading price because it reflects the latest situation. These SG-relations are combined by weights to form a synthetic relation that is fed into the GCN.

Figure 2: Synthetic relations generation.
SD, SS, and RS are short for “daily price in the same trend”, “60-day average price in the same trend”, and “60-day average price in the reverse trend”, respectively. The complex relationship between companies A and B causes their closing prices to have the same trend, while the 60-day average price has the reverse trend. Therefore, their intersection on adjacency matrices SD and RS is 1, and on RD is 0.

Download full-size image

DOI: 10.7717/peerjcs.1057/fig-2

Framework in whole

The TRMF model comprises feature selection, information extraction, and target prediction. Figure 3 shows the framework in whole.

Feature selection

We first determine three elements: stocks, experimental time, and trading features to get the original data and then filter it. Drawing on the practice of Feng et al. (2019), we perform two conditions to filter the stocks in the market: during the experimental period, (1) the actual trading day percentage greater than 98%; and (2) stock price not less than 15 dollars. The first condition concerns that too many suspension days could damage the statistical characteristics of the data and that the model can learn abnormal patterns. The second condition is to avoid penny stocks (https://www.sec.gov/fast-answers/answerspennyhtm.html) which are risky for general investors, as suggested by the U.S. Securities and Exchange Commission. For the suspension day, the stock trading data maintained the previous day’s record.

Information extraction

We divide the information reflecting stocks into three categories: technical factors, time-series information, and relationship impact factors. Among them, the technical factor shows stocks’ financial statistics, the time series information captures the historical sequence pattern, and the relational factor analyzes the impact of the market environment on the stock. The TRMF model combines them to predict stock prices.

Technical Factor: We pick up five mature and common technical factors: Moving Average (MA), Exponential Moving Average (EMA), Relative Strength Index (RSI), Chande Momentum Oscillator (CMO), On balance volume(OBV) (Fama & French, 2015; Nazemi, Heidenreich & Fabozzi, 2018; Wang, Zhuang & Feng, 2022).

Time Series Information: We use an Attn-LSTM to extract the stocks’ time-series information. LSTM is an excellent variant of the RNN, which uses a four-layer neural network in a recurrent unit to solve the problem of gradient disappearance and explosion during long-sequence training. Therefore, LSTM can outperform an ordinary RNN in longer sequences. It can be represented in short as: (4) $h t i = L S T M (x t i, h t - 1 i),$ where $x_{i}^{t}$ and $h_{i}^{t}$ denote stock features and hidden layer states of stock i on t day.

In addition, specific time points often relate to stock price movements, such as when enterprises release their quarterly financial reports or launch new products. Therefore, we use the enhanced LSTM model with attention. The attention mechanism assigns different weights to time series so that the model can pay attention to a particular critical trading day. (5) $u t i = tanh (W a h t i)$ (6) $α t, t^{'} i = \frac{exp (u i t ⊺ u i t^{'})}{\sum_{j = 1}^{T} exp (u i t ⊺ u i j)}$ (7) $\tilde{h t} i = \sum_{t' = 1}^{T} α t, t' i ⊺ u t' i$

where T is the length of time series and W_a ∈ℝ^C₂×C₁ is a randomly initialized matrix, optimized during training. $\tilde{h}$ $_{i}^{t}$ is the hidden layer state of Attn-LSTM, which describes the time-series features of stock i.

Relationship Impact Factor: The critical component of using the relationship to predict stock prices is acquiring and utilizing it. As described above, we used the SGR algorithm to mine stock relations and WGCN to obtain relationship impact factors.

Target prediction

Finally, all three kinds of information were spliced together and inputted into the fully connected layer. (8) $z^{i} = R e L U (w {[fi : {\tilde{h}}^{i} : yi]}^{⊺} + b)$

where fi is a vector composed of technical factors, yi is the relational impact factor from the WGCN model. zⁱ is the expected return of stock i on the next day, expressed here as o_t/o_t+1.o_t is the opening price of t day.

Experimental Setting

Experimental data

Experiments data is from the New York Stock Exchange (NYSE) and the Nasdaq Stock Exchange (NASDAQ). Because they are the top two stock markets in terms of global market capitalization in 2021 (https://data.worldbank.org/indicator/CM.MKT.LCAP.CD), and the most developed markets in the financial industry in the world. Then the constituent stocks of the S&P 500 are chosen for stock selection. This method, on the one hand, reduces the amount of calculation. On the other hand, these stocks contain 80% of the US stock market value (https://en.wikipedia.org/wiki/S%26P_500), so they can represent the entire market.

The experimental period was 01/02/2018–06/28/2019 and was divided into three intervals: the training (01/02/2018–12/28/2018), validation (01/02/2019–03/29/2019), and test (04/01/2019–06/28/2019) sets. We made this choice because S&P 500 Index shows periods of volatility, plummeting, and soaring in the test set, which is representative. Table 1 lists the number of trading days in the NYSE and NASDAQ.

Table 1:

Experimental date.

Market	Time range	Train	Validation	Test
	01/02/2018	01/02/2018	01/02/2019	04/01/2019
	06/28/2019	12/28/2018	03/29/2019	06/28/2019
NYSE	349	235	56	58
NASDAQ	349	235	56	58

DOI: 10.7717/peerjcs.1057/table-1

We screened 134 stocks on the NASDAQ and 325 on the NYSE during the experimental period based on the two filters mentioned above. Table 2 shows the number of stocks used in the experiment and Appendix lists all stocks.

Table 2:

Statistics of filtered stocks.

Index	Market	All	Condition 1	Condition 2	Final
S&P 500	NYSE	355	13	17	325
	NASDAQ	150	7	9	134

DOI: 10.7717/peerjcs.1057/table-2

We collected the opening price, closing price, highest price, lowest price, trading volume, and turnover rate as stock features. Because they are typical and common, and any stock exchange will provide them. Furthermore, all data were normalized to [0,1] using Min-Max normalization.

We selected four trading features: closing price, 5-day moving average, 20-day moving average, and 60-day moving average as input of SGR algorithm to generate SG-relations. They represent the stock’s trading behavior on the day, week, month, and quarter, respectively. Each trading feature has two different forms of correlation: the same trend and the reverse trend. So, there is eight SG-relations in total. Table 3 list all SG-relations.

Table 3:

SG-relations.

SG-Relation	Event	Minimum support
SD	the closing price of stock A and stock B changed in the same trend	0.6
RD	the closing price of stock A and stock B changed in the reverse trend	0.6
SW	the 5-day average price of stock A and stock B changed in the same trend	0.6
RW	the 5-day average price of stock A and stock B changed in the reverse trend	0.6
SM	the 20-day average price of stock A and stock B changed in the same trend	0.6
RM	the 20-day average price of stock A and stock B changed in the reverse trend	0.6
SS	the 60-day average price of stock A and stock B changed in the same trend	0.6
RS	the 60-day average price of stock A and stock B changed in the reverse trend	0.6

DOI: 10.7717/peerjcs.1057/table-3

Trading strategy

We used a daily cycle “buy-hold-sell” trading strategy to simulate market investments, which has been commonly used in many articles (Feng et al., 2019; Chen & Ge, 2019; Fischer & Krauss, 2018). On each t + 1 day of the test time set, at the opening moment, we simulated a trader selling out all stocks held at the opening price (buy on t day) and then buying the stock with the highest expected revenue.

Transaction rates are ignored (Feng et al., 2019).

Evaluation metrics

Our purpose was to predict stock prices accurately and balance the return and risk of stock investment. So we employed the following four metrics to evaluate and compare the models: mean square error (MSE), cumulative investment return ratio (IRR), maximum drawdown (MDD), and Sharpe ratio(SR) (Wang, Zhuang & Feng, 2022; Feng et al., 2019; Deng et al., 2016). The detailed formulas are presented in the Table B1 in the Appendix B.

MSE has been a standard evaluation metric for regression tasks in machine learning. Since directly reflecting the effect of stock investment, IRR is our primary metric. It is calculated by summing the return ratios of the selected stock on each testing day. MDD describes the worst possible situation in the investment process, which significantly affects investors’ pessimism. It is a significant risk metric. Finally, SR comprehensively evaluates the performance of an investment behavior from the two dimensions of return and risk and is familiar to investors. It should be noted that the r_f and δ_p in the SR formula are very sensitive to the calculation frequency. This study performs the calculation daily since back-testing uses the daily “buy-hold-sell” strategy. Smaller MSE and MDD values and larger IRR and SR values indicate better performance.

Experimental models & parameter settings

Table 4 shows all compared models in back-testing experiments. It is worth emphasizing that TGC is the main one because it incorporates industry relations from third-party platform into stock predictions and is the state-of-the-art stock relation-based solution. Moreover, its source code and data are open access (Feng et al., 2019).

Table 4:

Experimental models.

Model	Description
Five-Factors	Traditional multi-factor model
SVR	Multi-factor model based on machine learning
Attn-LSTM	Make use of time series information
TGC	Integrate time series information and industry relations
WGCN	Make use of SG-relations
TRMF	Integrate technical factors, time series information and SG-relations
Baseline	DJIA in NYSE and IXIC in NASDAQ

DOI: 10.7717/peerjcs.1057/table-4

We draw on previous literature to select the hyperparameters for existing models (Dai, Shao & Lu, 2013; Chen & Ge, 2019; Feng et al., 2019; Liu, Liu & Zhang, 2022). In particular, for SVR, Attn-LSTM and TGC, we follow the original setting in Dai, Shao & Lu (2013); Chen & Ge (2019) and Feng et al. (2019), respectively. For all deep learning methods, we apply the Adam optimizer with a learning rate of 0.001. Furthermore, We employ grid search to select the optimal hyperparameters regarding IRR for other models. We tune two hyperparameters for Attn-LSTM layer in TGC, the length of sequential input T_seq and hidden units C₁ within {5, 10, 20, 60} and {32, 64, 128}, respectively. Besides T_seq and C₁, we further tune the length of trading days T_r in SGR layer. Specifically, we tune T_r within {20, 60, 120, 240}. We futher tune C₃ and C₄ in WGCN layer within {64, 128, 256} and {64, 128, 256}, respectively.

Finally, the specified parameters of our trained TRMF model are as follows:

Input layer with T_seq = 20, which corresponds to a trading month.
LSTM layer with C₁ = 64 hidden neurons.
SGR layer with T_r = 240, which corresponds to a trading year.
WGCN layer with C₃ = 128 input-to-hidden neurons and C₄ = 128 hidden-to-output neurons.
Output layer with one neuron and ReLU activation function.

Results

Our study aims to generate and use more effective stock relations and integrate them for stock market forecasting. Therefore, we conducted experiments to answer two research questions: First, can our proposed SGR algorithm effectively extract relation information among stocks? Second, can the proposed TRMF model effectively integrate multi-dimension information and predict the stock market better than previous researches in both return and risk?

Study of SG-relations

We use the proposed SGR algorithm to mine SG-relations in the NASDAQ and NYSE markets. Then, an experiment will compare SG-relations with the industry relations. Industry relations are extracted from sector-industry data maintained by NASDAQ Inc (https://www.nasdaq.com/screening/industries.aspx). Feng et al. (2019) used them in their TGC model and presented in its appendix.

Figure 4 qualitatively presents the coverage of SG-relations and industry relations via the distribution of intersection colors. Firstly, we can see that the number of red points representing coverage far exceeds the number of blue points representing non-coverage, so SG-relations can largely cover industry relations. Furthermore, a large number of black points are scattered across the graph, which indicates that the SGR algorithm is able to expand the industry relations.

Figure 4: SG-relations and industry relations in NYSE & NASDAQ.
The horizontal axis is the index of 50 stocks, and so is the vertical axis. The color of the intersection indicates whether there is a relationship between the two stocks. The white point indicates that two stocks have neither a SG-relation nor a industry relation. In contrast, the red point indicates that both relations exist, and the black point with only SG-relation while the blue point with only industry relation. The figure just contains top-50 stocks because the number of stocks in each market is different and too large.

Download full-size image

DOI: 10.7717/peerjcs.1057/fig-4

Figure 5 shows examples of SG-relations. Among them, we can see that SG-relations can cover many industry relations (https://en.wikipedia.org/wiki/List_of_S%26P_500_companies), such as FB.O & GOOG.O, AEP.O & LNT.O, and APA.O & FANG.O. Furthermore, there are several non-industry relations in SG-relations. For example, NVDA.O is the supplier of AAPL.O. Their 60-day average prices show the same trend, yet AAPL.O slightly lags behind NADA.O. In addition, NXPI.O has an exclusive relation with XEL.O because the former represents the new energy vehicle industry while the latter represents the traditional fossil energy industry. Their 20-day average prices fluctuate reversely. Finally, EBAY.O’s 60-day average price shows an apparent reverse trend with EXC.O’s for a long time while we cannot figure it out by industry analysis. There may be some internal connection between them. This information can help predict their price more accurately.

Table 5 quantitatively shows coverages of SG-relations with the non-zero number to industry relations. We can see that the SG-relation can largely cover the industry relations provided by the third-party platform. Among them, SG-relation can cover more than 60% of industry relations in the NYSE market, and even more than 80% in the NASDAQ market. On the other hand, the SGR algorithm adds many stock relations that the third-party platforms fail to provide. In both the NYSE and NASDAQ markets, industry relations contain no more than 5% of SG-relations. Although these stock pairs with SG-relation are not in the same industry, there is a strong correlation between their prices in different periods. These relations can be used as important information for stock price prediction. SG-relations generated by different trading data have different coverage rates for industry relations. Of these, the SG-relation corresponding to the daily price has the highest coverage, and the coverage of other trading data decreases with the growth of the period. Figure 6 shows the rule more intuitively. This phenomenon reveals that the short-term prices of companies in the same industry are correlated, while the long-term price correlation can reflect other relationships.

Table 5:

SG-relations and industry relations cover each other on NYSE and NASDAQ. There are only SG-relations with a non-zero number. Thus, RD and RW are removed.

	NYSE		NADSAQ
SG-Relations	Number	Coverage	Number	Coverage
SD	8,075	0.444	2,010	0.622
SW	4,701	0.293	1,148	0.403
SM	2,111	0.16	618	0.214
RM	12	0	3	0.005
SS	1,656	0.09	773	0.179
RS	247	0	56	0.005
Total	13,381	0.624	3,305	0.816
industry relations	721	0.034	201	0.05

DOI: 10.7717/peerjcs.1057/table-5

Figure 6: Coverage ratio of SG-relations to industry relations.
The labels in the vertical ordinate - D, W, M, and S denote day, week, month, and season period, respectively. The same trend label in D refers to SD relation, and so on.

Download full-size image

DOI: 10.7717/peerjcs.1057/fig-6

Study on back-testing return of TRMF

Figure 7 presents the cumulative rate of return of all models. Their performance follows the order of TRMF > TGC > Attn-LSTM > WGCN > SVR > Baseline > Five-Factors. And most of the time, TRMF is on top. Therefore, TRMF can bring maximum benefits to investors in the experimental environment.

Figure 7: Comparison of cumulative rate of return among models.

Download full-size image

DOI: 10.7717/peerjcs.1057/fig-7

Table 6 gives the results of each experimental model under four evaluation metrics. We can see that TRMF model outperforms comparison models on most metrics. First, we found both TRMF and TGC outperforms traditional Five-Factors model, SVR, and Attn-LSTM on two markets with great performance w.r.t. IRR, MDD, and SR. It varifies that incorporating stock relation information into the model can significantly improve the stock price prediction. In terms of MDD, the prediction of both TRMF and TGC are significantly smaller than those of other comparable models and benchmark indices. Both them integrate stock relations into the prediction. Therefore, we hypothesized that the relation category factor could effectively reduce the drawdown because the pricing of related stocks gives investors confidence and reduces the overselling situation because of short-term market sentiment. Furthermore, TRMF’s performance was better than the TGC using industry relations, which may be related to the higher number of SG-relations. This indicates that our proposed SG-relations can capture more relation information to enhance the robustness of model predictions. However, TRMF failed to reach the best performance w.r.t. MSE, though it was still the second of all models. The reason could be because TRMF has the most parameters. The WGCN, which only uses relation information, is inferior to the Attn-LSTM which only uses time information in terms of IRR and SR, and it cannot achieve high prediction accuracy with respect to MSE. Therefore, the time series information of stocks is more important, and the stock prices of different stocks retain a large degree of autonomy.

Table 6:

Performance comparison among models.

	NYSE				NADSAQ
Model	MSE	IRR	MDD	SR	MSE	IRR	MDD	SR
Five-Factors	7.33e−4	0.014	8.2%	0.011	8.27e−4	0.001	10.3%	0.001
SVR	5.21e−4	0.101	5.9%	0.255	6.32e−4	0.109	7.2%	0.166
Attn-LSTM	2.99e−4	0.214	5.2%	0.237	4.63e−4	0.131	6.5%	0.326
WGCN	3.15e−4	0.172	5.9%	0.172	4.87e−4	0.117	24.9%	0.153
TGC	3.13e−4	0.271	4.6%	0.345	4.37e−4	0.282	4.7%	0.335
TRMF	3.12e−4	0.385	3.2%	0.544	4.41e−4	0.334	4.4%	0.492
Baseline	7.33e−4	0.014	8.2%	0.011	8.27e−4	0.001	10.3%	0.001

DOI: 10.7717/peerjcs.1057/table-6

Conclusions

A stock’s future price is affected by its historical trading data and the related stocks. Therefore, in this article, we expected to combine stock relations with multiple information for the price prediction. To solve the problem, we proposed a TRMF model. The core of the TRMF model is an algorithm for richly generating stock relations, named SGR algorithm, which uses trading data to mine multiple SG-relations automatically. Experiments based on data from the NYSE and NASDAQ demonstrated the effectiveness of our solution. The results show that SG-relations can cover more than 60% of industry relations from the third-party platform. TRMF outperformed SVR, Attn-LSTM, and TGC with higher IRR, SR and lower MDD.

We proved that the correlations of price fluctuations in different periods reflect stock relations. But other transaction dimensions, such as trading volume, turnover, and others, are not discussed in the paper. Therefore, in the future, we will investigate the potential of extending the self-generating relation algorithm to other stock trading data to gain more effective relationship information and improve stock price predictions.

Supplemental Information

Source code of TRMF model with exprimental dataset

Folders Data, Model and ZBacktest represent experimental data, model and backtesting experimental process respectively. Hyperparameters are collected in the A_Global file.

DOI: 10.7717/peerj-cs.1057/supp-1

Download

[1] Bao W, Yue J, Rao Y. 2017. A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLOS ONE 12(7):e0180944

[2] Baruník J, Křehlík T. 2018. Measuring the frequency dynamics of financial connectedness and systemic risk*. Journal of Financial Econometrics 16(2):271-296

[3] Bruna J, Zaremba W, Szlam A, LeCun Y. 2014. Spectral networks and locally connected networks on graphs. In: 2nd international conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, conference track proceedings.

[4] Cao J, Li Z, Li J. 2019. Financial time series forecasting model based on CEEMDAN and LSTM. Physica a-Statistical Mechanics and its Applications 519:127-139

[5] Chen S, Ge L. 2019. Exploring the attention mechanism in LSTM-based hong kong stock price movement prediction. Quantitative Finance 19(9):1507-1515

[6] Chen Y, Wei Z, Huang X. 2018. Incorporating corporation relationship via graph convolutional neural networks for stock price prediction. In: Proceedings of the 27th ACM international conference on information and knowledge management, CIKM ’18. New York, NY, USA. Association for Computing Machinery. 1655-1658

[7] Dai W, Shao YE, Lu C-J. 2013. Incorporating feature selection method into support vector regression for stock index forecasting. Neural Computing & Applications 23(6):1551-1561

[8] Defferrard M, Bresson X, Vandergheynst P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems, vol. 29. Curran Associates, Inc.

[9] Deng W, Wang G, Zhang X, Xu J, Li G. 2016. A multi-granularity combined prediction model based on fuzzy trend forecasting and particle swarm techniques. Neurocomputing 173:1671-1682

[10] Diebold FX, Yilmaz K. 2014. On the network topology of variance decompositions: measuring the connectedness of financial firms. Journal of Econometrics 182(1):119-134

[11] Fama EF, French KR. 2004. The capital asset pricing model: theory and evidence. Journal of Economic Perspectives 18(7):25-46

[12] Fama EF, French KR. 2015. A five-factor asset pricing model. Journal of Financial Economics 116(1):1-22

[13] Feng F, He X, Wang X, Luo C, Liu Y, Chua T-S. 2019. Temporal relational ranking for stock prediction. ACM Transactions on Information Systems 37(2):27:1-27:30

[14] Ferrer R, Hussain Shahzad SJ, Lopez R, Jareno F. 2018. Time and frequency dynamics of connectedness between renewable energy stocks and crude oil prices. Energy Economics 76:1-20

[15] Fischer T, Krauss C. 2018. Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research 270(2):654-669

[16] Fout A, Byrd J, Shariat B, Ben-Hur A. 2017. Protein interface prediction using graph convolutional networks. In: Advances in neural information processing systems, volume 30. Curran Associates, Inc.

[17] Greenwood R, Shleifer A. 2014. Expectations of returns and expected returns. Review of Financial Studies 27(3):714-746

[18] Kazem A, Sharifi E, Hussain FK, Saberi M, Hussain OK. 2013. Support vector regression with chaos-based firefly algorithm for stock market price forecasting. Applied Soft Computing 13(2):947-958

[19] Kim HY, Won CH. 2018. Forecasting the volatility of stock price index: a hybrid model integrating LSTM with multiple GARCH-type models. Expert Systems with Applications 103:25-37

[20] Kipf TN, Welling M. 2017. Semi-supervised classification with graph convolutional networks. In: ICLR.

[21] Liu Z, Li Y, Liu H. 2021. Fuzzy time-series prediction model based on text features and network features. Neural Computing and Applications Epub ahead of print 2021 10 March

[22] Liu J, Liu Y, Zhang Q. 2022. A weight initialization method based on neural network with asymmetric activation function. Neurocomputing 483:171-182

[23] Long J, Chen Z, He W, Wu T, Ren J. 2020. An integrated framework of deep learning and knowledge graph for prediction of stock price trend: an application in chinese stock exchange market. Applied Soft Computing 91:106205

[24] Ma C, Liang Y, Wang S, Lu S. 2022. Stock linkage prediction based on optimized LSTM model. Multimedia Tools and Applications 81(9):12599-12617

[25] Narayan PK, Phan DHB, Liu G. 2021. COVID-19 lockdowns, stimulus packages, travel bans, and stock returns. Finance Research Letters 38:101732

[26] Nasreen S, Tiwari AK, Eizaguirre JC, Wohar ME. 2020. Dynamic connectedness between oil prices and stock returns of clean energy and technology companies. Journal of Cleaner Production 260:121015

[27] Nazemi A, Heidenreich K, Fabozzi FJ. 2018. Improving corporate bond recovery rate prediction using multi-factor support vector regressions. European Journal of Operational Research 271(2):664-675

[28] Pang X, Zhou Y, Wang P, Lin W, Chang V. 2020. An innovative neural network approach for stock market prediction. Journal of Supercomputing 76(3):2098-2118

[29] Rezaei H, Faaljou H, Mansourfar G. 2021. Stock price prediction using deep learning and frequency decomposition. Expert Systems with Applications 169:114332

[30] Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20(1):61-80

[31] So MKP, Chu AMY, Chan TWC. 2021. Impacts of the COVID-19 pandemic on financial market connectedness. Finance Research Letters 38:101864

[32] Tang X, Zhang Q, Dai X, Zou Y. 2020. Neural membrane mutual coupling characterisation using entropy-based iterative learning identification. IEEE Access 8:205231-205243

[33] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. 2017. Attention is all you need. In: Advances in neural information processing systems, volume 30. Curran Associates, Inc..

[34] Wang J, Zhuang Z, Feng L. 2022. Intelligent optimization based multi-factor deep learning stock selection model and quantitative trading strategy. Mathematics 10(4):566

[35] Wu Y, Lian D, Xu Y, Wu L, Chen E. 2020. Graph convolutional networks with markov random field reasoning for social spammer detection. In: Proceedings of the AAAI conference on artificial intelligence, volume 34(01). 1054-1061