A multi-channel cross-residual deep learning framework for news-oriented stock movement prediction

Abstract Stock market movement prediction remains challenging due to random walk characteristics. Yet through a potent blend of input parameters, a prediction model can learn sequential features more intelligently. In this paper, a multi-channel news-oriented prediction system is developed to capture intricate moving patterns of the stock market index. Specifically, the system adopts the temporal causal convolution to process historical index values due to its capability in learning long-term dependencies. Concurrently, it employs the Transformer Encoder for qualitative information extraction from financial news headlines and corresponding preview texts. A notable configuration to our multi-channel system is an integration of cross-residual learning between different channels, thereby allowing an earlier and closer information fusion. The proposed architecture is validated to be more efficient in trend forecasting compared to independent learning, by which channels are trained separately. Furthermore, we also demonstrate the effectiveness of involving news content previews, improving the prediction accuracy by as much as 3.39%.


Introduction
A time series exhibits a wide variety of features whose historical properties and future evolution are worth investigating (Mills, 2019).Time series prediction is perceived to be an essential technique in many real-world applications.Among them, financial markets have been the focus of much research to date given its importance in facilitating economic development (Hsu et al., 2016).Individuals and organisations can be well-supported to make rational investment decisions and manage financial and operational risks efficiently.Yet, the prediction task remains challenging due to the dynamic, nonlinear, nonstationary, nonparametric properties of the financial market (Abu-Mostafa & Atiya, 1996).To grasp such volatile nature, researchers in many disciplines focus on developing intelligent techniques for effective financial market forecasting, in particular the stock market.
Early approaches to stock market behaviour prediction were primarily based on statistical techniques pre-assuming linearity, stationarity, and normality (Shah et al., 2019).These conventional methods, such as autoregressive integrated moving average (ARIMA), work for a particular time series but fail to capture the nonlinear characteristics in general (Zhang, 2003).The emergence of artificial intelligence helps overcome such limitation and deep neural network models are being explored by recent research, in various financial and economic applications, aiming to reach the full potential (Li, Xu, et al., 2020;Liu et al., 2021;Long et al., 2019;Zheng et al., 2021).
Although such deep learning (DL) approaches perform well according to the prediction accuracy, they, to some extent, ignore the fact that the market also reacts arbitrarily to unexpected external events.Stock market forecasting solely based on quantitative financial indicators can be vulnerable to economic and social changes.There has been an increasing trend towards integrating heterogeneous sources of information to improve prediction efficiency.Such data sources, e.g., financial news and social media feeds, can complement valuable information that is not reflected in the intrinsic historical market values.

Literature review
The most extensively used DL model architectures dealing with financial time series are recurrent neural network (RNN) and convolutional neural network (CNN).RNN, mainly its variants, has been dominantly adopted in stock market prediction due to its powerful capability in sequence processing (Ghosh et al., 2022;Kim & Kang, 2019).It forms an information loop to explore the temporal dynamic patterns, allowing learning from the previously received input features.Emerged from the visual cortex structure, CNN has exhibited its strength in feature extraction.Compared to RNN, CNN benefits from a lower number of parameters by filtering with the kernel window function.One way to apply CNNs to financial forecasting is to convert the selected predictors into a timeseries graph (Sim et al., 2019).Besides, the prediction task can also be implemented with simple and compact 1 D CNNs, realising a lower implementation cost than feeding twodimensional image data (Cavalli & Amoretti, 2021).
Yet, Dingli and Fournier (2017) indicated that state-of-the-art classifiers (e.g., LR, SVM) yield slightly better outputs than vanilla 1 D CNNs, which thus demand further configurations to surpass the techniques above.For example, since the output at a specific time is conditioned only on the past input samples in financial prediction tasks, a temporal convolution network (TCN) can be a suitable candidate.TCNs introduce dilated causal convolutions (DCC) and residual connections to a simple 1 D CNN to maintain such casuality.Based on an empirical evaluation (Bai et al., 2018), a TCN is substantially superior to LSTMs and GRUs in capturing long-term dependencies.It also presents longer effective memory and higher execution speed (Borovykh et al., 2019;Wiese et al., 2020).
More than half of the recent existing DL implementations in the financial domain have focussed on stock price or stock market index (Sezer et al., 2020).In addition to its historical patterns, the stock market is also highly influenced by certain public events, which can be extracted from online news articles.Table 1 lists the implementation details of recent stock price (index) forecasting works using financial news data.In Table 1, Period refers to the training and testing period, and Variables lists different input features that are fed into the model.Lag denotes the time length of the input vector, and Horizon denotes the time length of the output vector to be predicted.
As illustrated in Table 1, news-oriented stock market forecasting has been a popular research area.RNN and its variants, including LSTM and GRU, are still dominant for time series data processing.Hybrid architectures that maximise the strengths of different models are also attractive.More studies prefer a 2-to 5-year research period due to the high computational costs for collecting online news articles over an extended period.In this study, the input data length is around five years, sufficient for producing satisfactory results.Regarding the qualitative datasets, news from Reuters and Bloomberg, released by Ding et al. (2015), is a popular news dataset but no longer publicly available for research purposes.As a result, the news data collection relies on requiring from legit agencies or web scraping.In addition, most of the literature utilised news headlines solely as the textual data source.It is believed that highly concentrated titles can convey the most indicative information, and incorporating news content risks adding irrelevancies.Although price prediction is essentially a regression problem, a correct directional movement is perceived as a more crucial research area.In this sense, trend prediction becomes a classification problem, taking upward and downward price movements into account.

Contribution
When designing DL prediction models, three factors play an essential role in the model accuracy, namely network design, input feature, and quantity of training data (Walczak, 2001).Most of the innovations for deep prediction models are embodied in the network design.Considering CNNs as a natural starting point for sequence modelling, this paper develops a novel DL architecture, namely a multi-channel cross-residual TCN (mc-CRTCN), to forecast the stock market behaviour.The proposed prediction system incorporates stock market data and financial news to handle unexpected abrupt changes or unexplainable extreme fluctuations.The representation of both news headlines and abstracts is obtained via the Transformer Encoder (TE).
A few studies (Liu et al., 2019;Tang et al., 2018) have shown the superiority of TE to CNN-and RNN-based models for deep semantic features extraction.Given multimodal information, a vital issue is concerning how they are processed and fused.Existing knowledge-or event-driven stock market prediction has already utilised the multi-channel system, with one channel responsible for processing stock price data and others dealing with external stock-related tasks (Chung & Shin, 2020;Nti et al., 2021;Zhang et al., 2018).Generally, these related channels are trained simultaneously but separately and then concatenated at the end for an information fusion.This paper presents a different fusion method by jointly training these channels through cross-connections, by which a stronger coupling of multimodal information is realised at an earlier stage.The cross-residual learning (CRL) technique, pioneered by Jou and Chang (2016), has been integrated in visual recognition and pattern classification tasks (Lyu et al., 2020).According to these existing applications, CRL is regarded as an in-network regularisation, possessing greater generalisation ability.
Since an accurate prediction of the exact stock index value can be highly challenging, a classification problem is formulated instead, aiming at predicting the specific movement direction for the next trading day according to relevant information extracted from the financial market news.Our contributions can be summarised as: 1.A hybrid configuration to the baseline model is presented using TCN and TE.It effectively enhances the performance of vanilla 1D CNNs in stock market prediction.2. Both sentence-and paragraph-level features are extracted via TE to capture indicative information from publicly available news data.The influence of adding content previews or description to the news input channel is evaluated in the comparison analysis.3. A CRL procedure is implemented through cross-connections between multiple channels.The primary motivation is to closely couple multimodal information at an earlier stage to reveal cross-task dependencies.A modification to the original CRL is proposed for stock market forecasting.
The remainder of this study is constructed as follows.Section 2 introduces the relevant preliminaries.The proposed prediction strategy based on mc-CRTCN is elaborated in Section 3. Further, Section 4 describes the data collected for the prediction model and presents the numerical experimental results and their implications.Finally, Section 5 concludes this study and prompts several future research directions.

Temporal convolution network
Using multiple DCC layers, TCN differentiates itself from standard 1 D CNNs when handling long sequences.Causal convolution was initially proposed as the main ingredient of WaveNet (Oord et al., 2016), attempting to avoid the involvement of future information.In other words, it helps inspect that the channel only considers historical index values, as required in financial time series prediction.For a convolution operator to be causal, it must involve causality described as: where p(X) represents the maximum likelihood given the observed past index values and T denotes the total length of the observation period.Financial market forecasting normally desires a full coverage of historical patterns to improve model robustness and accuracy.In this case, it necessitates a deeper network or a larger filter (also kernel) size, neither of which is feasible for prediction with very long sequences.The use of dilated convolution enables TCN to efficiently address this problem without adding computational complexity.The dilated rate d is introduced to represent the spacing between the skipped input values in each convolutional layer.Since d is doubled every layer up, TCN eventually procures an exponential growth in the receptive field without losing resolution or coverage (Yu & Koltun, 2015).Formally, DCC on element s from a sequential input series x 2 R n with a filter f : f0, . . ., kÀ1g is defined as: where * d denotes the dilated convolutional operand and sd Á i indicates the past direction.In addition, a TCN yields an output tensor of the same dimension as the input tensor, thus zero-padding is applied on every subsequent layer to secure the equal length of the previous layer.

Transformer encoder
The textual features are captured through the TE, entirely based on self-attention without recurrence or convolution.The TE stacks a certain number of multi-head attention (MHA), where the self-attention is implemented, and pointwise fully connected feed-forward layers (Vaswani et al., 2017).Each input embedding X 2 R d has three different vectors, namely query Q, key K and value V, which are linearly projected h times with learnable parameter matrices W q 2 R dÂd k , W k 2 R dÂd k , and W v 2 R dÂd v : The scaled dot-product attention function performed in the MHA can be described as: where ðx i Þ denote the vector representation of the i th input token in the head h.hQ ðhÞ ðx i Þ, K ðhÞ ðx j Þi defines the compatibility of x i with x j in the head h and is normalised to compute the attention weight.Equation ( 5) defines the self-attention, a weighted sum of values.Therefore, MHA can be obtained stacking h parallel layers, or attention heads, given by where head i ¼ AttnðQ ðiÞ , K ðiÞ , V ðiÞ Þ and W o 2 R hd v Âd : Such architecture weights each word in the news pieces according to its attention weight through training the query and key matrices of the attention layers to determine the usefulness of different words, namely the attention weight.The attention mechanism was initially integrated into sequence-to-sequence models (i.e., RNNs) for a selective concentration on relevant information.Compared to the standard attention mechanism, self-attention creates a shorter path between any combination of either distant input or output positions.Thus, it is more efficient to compute, parallelise, and capture long-range dependencies, all of which are crucial for contextual embedding (Mishev et al., 2020).

Cross residual learning
The proposed architecture integrates heterogeneous input sources by adding a CRL module between different data types.Initially, a residual learning framework is introduced to deep neural networks to address the degradation problem and ease the training process (He et al., 2016), given by where F indicates the residual mapping fit by stacked convolutional layers, and W i is the weight associated with the i th layer.Since F þ x is implemented through a shortcut connection and element-wise addition, F and x must have the same dimension.Otherwise, a linear projection W S is required on the shortcut connection to match the dimension.
Jou and Chang (2016) put forward an extension that enables intuitive learning across domains in the field of visual recognition, aiming at building a more versatile multimodal network that leverages cross-task dependencies.Given a target task t and N À 1 other related tasks, a cross-residual module is defined as: where the superscript (.) denotes the task index and j 6 ¼ t.It comprises the residual learning for t and the additive contribution of j to t. Figure 1 illustrates a CRL module between multiple related tasks, where solid shortcuts represent identity mapping and dashed shortcuts represent cross-residual connections.It serves as a form of innetwork regularisation by biasing at a layer level and enables greater network generalisation.

The proposed prediction strategy
This research develops a novel news-driven prediction system for the stock market trend.Its main novelty lies in how these different input data are fused during the training procedure to output better model performance.In general, the current state of the stock market can be influenced not only by its past performance but also by the latest publicly-released news.That is, dealing with the stock index values and the news pieces are considered as related tasks.The proposed model adopts different channels to handle feature extraction for these related tasks, respectively.An overview of the multi-channel setting is illustrated in Figure 2. Different types of representation are processed simultaneously and eventually concatenated to a fully connected layer followed by Softmax activation.Notably, a crossresidual module is highlighted with dashed lines.It helps to select textual information that is highly relevant to the stock movement and to investigate the interrelatedness between multiple sources of input data by leveraging cross-task dependencies.In this setting, an earlier integration is realised between channels, attempting to avoid information loss, thus to improve the prediction performance for the stock market behaviour.

Stock index channel
When processing sequential data, a 1 D feature map is produced per kernel by sliding several kernels across the series.That is, within a fixed window of the length w, the partial sequential pattern is detected.The observed stock market value x w,T ¼ (x t-wþ1 , x t-wþ2 , … , x t ) is encoded to an index vector V x w, T to represent the features captured in the latest time window.
The stock index channel employs four DCC blocks (as in Figure 3).Each block consists of two stacked DCC layers, followed by the weighted normalisation to counteract gradient explosions, SELU to introduce the non-linearity, and dropout with skip connections to prevent the DCC from overfitting.Although ReLU is predominantly adopted in CNNs to realise lower error rates, Daiya et al. (2020) suggested that SELU performed better when the output needed to preserve negative values (i.e., negative stock movement).The drop rates remain the same for DCC layers in the same residual block but differ among blocks to ensure the latent representations learned are not unduly influenced by past patterns far back in the time series.Specifically, a dropout rate of 0.0, 0.1, 0.2, and 0.4 is assigned.
In the DCC residual block, an 1 Â 1 convolution is optional and added only when the residual input and output are of discrepant dimensions.Besides dilated convolutions, residual connections can also effectively capture long-range temporal patterns (Bai et al., 2018).The output of an identity mapping, the mapping performed by skipping a certain number of layers, is added to the output of stacked DCC layers to handle the degradation problem exposed with a deeper network structure.

Financial news channel
On the textual input side, every single piece of news is encoded to derive the news representation.Several existing studies (Ding et al., 2015;Radinsky et al., 2012) indicated news headlines are more informative than news contents, since covering the entire content may introduce irrelevancies that degrades the prediction accuracy.Suppose that there are in total m pieces of news data published on the target date t, the channel obtains the news vector Eðh t, 1 , . . ., h t, m Þ ¼ V N t , where h t,j (j 2 [1, m]).The news channel stacks four standard encoder blocks (as in Figure 4), using three self-attention heads for MHA and ReLU activation for position-wise fully connected feed-forward layers with no dropout.A Mask function is adopted in the attention layer to maintain the causality.
Although the full length of news content can influence the prediction results reversely, a more condensed abstract may not.In addition to the news headlines, the corresponding preview texts are encoded to examine its effect on stock forecasting.Given a paragraph set para ¼ fs 1 , . . ., s N S g, where N S denotes the number of sentences in the preview content, the channel obtains another news vector Eðpara t, 1 , . . ., para t, m Þ ¼ V Ã N t whose length is fixed at 80. To avoid significant information loss during compressing too large the input size, V N t is not spliced with V Ã N t when these vectors are fed into the Transformer.Therefore, two independent TE channels are adopted for news headlines and preview texts, respectively.The whole process is realised through PyTorch 1.10.0.For the model initialisation, the channel utilises the word embedding pre-trained on Google news using Word2Vec, and finetunes the word vector along with the model.

Channel concatenation
The proposed model leverages an attention mechanism to investigate the interrelatedness between multiple channels by selecting textual information that is highly relevant to stock movements.A cross-residual module is highlighted in this architecture that replaces the self-residual unit in a conventional network.Compared to the setting proposed by Jou and Chang (2016), some modifications are made when the module is applied to the stock market, given by V ðlþ1Þ x w, T ¼ FðV ðlÞ where the superscript l denotes the block (or layer) index and F denotes the residual learning function.W j S refers to the linear projection when the news representation has a different dimension from the index representation.The training, validation, and testing split is 70%, 15%, and 15%, respectively.The training process comprises both independent-and cross-learning, which primarily deals with parameter fine-tuning, i.e., we train one channel while maintaining the parameters fixed of the other channel.As a result, there are relatively fewer training parameters, thereby improving the learning efficiency and generalisation.The model is trained for 100 epochs with a batch size of 32, and the initial learning rate decays linearly by 0.1 every 10,000 iterations.
The model employs an observation window w of 20 days, meaning the index data from past 20 trading days are required, to capture a relatively large receptive field.In addition, the financial news published only on the last day of w is collected since historical events normally influence the stock market less than the latest news.The target horizon is one next trading day.By sliding w (as in Figure 5), the sample predicted each time is fed back into the proposed network to predict the next index sample.

Data description
The stock market index is selected as the input variable (typically computed as the weighted arithmetic mean of the prices of different selected stocks) since it is less volatile than a single stock and more indicative of a general state of the national economy.S&P 500 (Standard & Poor's 500) is one of the most commonly used index data when predicting the stock market index.It covers 500 listed companies in the U.S. that are highly representative of the rise and fall of the national economy.The closing price of S&P 500 are collected from Yahoo Finance during November 2015 and October 2019.There are in total N I ¼ 1003 index instances collected.Considering two output categories only (i.e., either positive or negative), the input data distribution is relatively balanced (54.14% vs. 45.86%).There is only one instance that experienced no change in value, which is excluded for simplicity.
Meanwhile, financial news that are publicly released only on trading days are also collected over this period.Both news headlines and preview texts are scraped from Reuters and CNBC (Consumer News and Business Channel), two of the most prominent multimedia agencies that provide an expansive range of real-time business and financial news.A total of 68,253 pieces of news is collected from target websites; each piece is marked with the last updated date, headlines, and description.The punctuation and stop-words are removed.To limit the length of the input texts, zero paddings are applied on the right side of the headline and preview input tensor that fails to reach the fixed size of 20 and 80.This setting ensures the implementation of cross-connections through linear projection.

Performance metrics
In the presented binary classification problem, the confusion matrix is utilised to assess the model quality.Confusion matrix, also error matrix, is another typical criterion for statistical classification in supervised learning, not limited to binary categories.The most basic terms in a confusion table are true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).TP indicates both predicted and actual values are positive, and TN indicates both are negative.FP represents positive prediction but negative observation, and it also refers to as Type I error related to overestimation.Conversely, FN refers to as Type II error related to underestimation.
The confusion matrix derives four evaluation indicators, namely accuracy, precision, recall, and F1-score.Accuracy is one of the most intuitive measures that explains the consistency between the predicted output and the actual level.Nevertheless, Accuracy may produce misleading results if the dataset is asymmetric (i.e., FP significantly differs from FN).Besides accuracy, F1-score is also adopted.It represents the weighted average of Precision and Recall, taking both FP and FN into account.When dealing with imbalanced datasets, F1-score is usually more beneficial than Accuracy by penalising extreme values.

Model performance
Performance of the proposed mc-CRTCN is evaluated in three progressive aspects: (i) an assessment of the basic TCN architecture, (ii) the effect of different residual learning patterns, i.e., CRL and independent learning (IL), and (iii) the effect of involving news previews in the news channel.The numerical results are primarily shown in the form of comparative analysis.In addition, other than a binary classification problem, different output settings are applied to examine the effectiveness and robustness of the proposed mc-CRTCN.

TCN vs. baseline models
The stock index channel that deals with market behaviour itself is the base of the proposed framework.Considering 1 D CNN as a natural starting point, the framework extracts such intrinsic historical features through the generic TCN architecture.
In order to demonstrate its strength in financial time series forecasting, a comparison analysis is implemented between TCN and some conventional prediction models, including ARIMA, LSTM, and CNN.The stochastic gradient descent (SGD) is employed to optimise the training process, with a kernel size of 2 and residual blocks of 10.All the settings for the input data keep the same.These baseline models are evaluated only with the stock index data, and the results are shown in Table 2.The generic ARIMA model based on a linear assumption shows its weakness in processing dynamic stock market data.Deep neural network models achieve better performance.In terms of accuracy, TCN greatly outperforms other baseline models on the stock trend prediction over S&P500 index, showing an obvious advantage as a basic component of the proposed framework.Compared to vanilla CNN, TCN obtains superior outcomes as it benefits from stacked DCCs.Therefore, DCC neural network can be an ideal model structure for stock market forecasting.

CRL vs. IL
In addition, since this paper primarily emphasises a different residual pattern among various processing channels, another comparison analysis is conducted between normal and cross residual learning.The index prediction results are contrasted with and without cross-connections when using news headlines only.It helps to assess the effect of adding a CRL module between channels.The model performance is then evaluated with an inclusion of previews texts (or descriptions) of news contents.Notably, when feeding both headlines and descriptions to the proposed model, three channels are established, one TCN channel dealing with the stock market index and two TE channels extracting news representation.Table 3 summarises the prediction results.In Table 3, IL refers to the concatenation procedure without CRL, in which stock index and news headlines are trained independently and eventually combined to a fully connected layer.Based on the confusion matrix-relevant measurements, we draw the following findings: 1. Overall, the proposed prediction model that applies CRL using both news headlines and preview texts as the external inputs delivers the most satisfying results.2. Using the same set of input variables, the model with CRL procedures outperforms the model with IL in Accuracy and F1-score.The measurement increases by 0.63% and 0.0015, respectively, when employing headlines solely.3. The increments in (2) are not significant mainly because the proposed model is evaluated benchmarking a comparative mature DCC-TE setting.In other words, this paper primarily assesses the effect of cross-connections, which indeed improves the prediction accuracy.These increases are still appreciable for stock market prediction due to the highly uncertain properties of the financial market.4.An enhancement of model performance presents in every evaluation criterion when feeding headlines and descriptions into TEs, compared to feeding the news titles only.An average increase is observed of 3.39% and 0.0294, respectively, in Accuracy and F1-score.Therefore, if available, the news preview text or description is a suitable input candidate when employing financial news for stock market prediction.

Illustration of news previews
A few sample news instances are tabulated below to exemplify (4) above, demonstrating how preview texts provide supplementary information relevant to the stock market.In Table 4, bold texts indicate key information of the above news headlines.Bold underlined texts represent information that is parallel to the headlines.Bold italic texts represent information that is additional to the headlines.All key words are selected from a subject perspective.Among the five selected pieces of news, preview texts of news (a) and (b) provide repetitive messages.This type of contents risks lower prediction accuracy as it may bring in more irrelevancies.On the other hand, descriptions of news (c) to (e) demonstrate additional information that is not revealed from the headlines, which further facilitate the prediction tasks.Based on this study, despite a potential risk of input redundancies, the involvement of the concentrated descriptive contents still efficiently improves the model performance.

Classification output setting
The original mc-CRTCN architecture formulates a binary classification problem for stock trend prediction, where the output is either positive or negative.In reality, investors may be more interested in a particular interval where the next day index value may fall into.Therefore, different output settings can be applied to the classification task.
The model performances using different classification modes are assessed based on prediction accuracy, as illustrated in Figure 6.The prediction accuracy decreases with narrowed output intervals.Still, CRL with both headlines and preview texts achieves a superior result over IL.Even in a 5-class trend prediction task, it is able to correctly forecast the stock index value more than half of the observation period.

Conclusion and future directions
Financial market forecasting has been one of the most discussed DL applications.The market values are intrinsically unpredictable; nevertheless, various types of information that are publicly accessible can help discover potential contributors to the market movements.This paper proposes a novel DL architecture, namely mc-CRTCN, that leverages online financial news articles to forecast the stock market movement for the next trading day.It is a multi-channel news-driven system established primarily based on TCN (or DCC) and TE.DCC is applied to capture the long-range temporal patterns of stock market index fluctuations within a fixed time window.The index value (e.g., S&P500) is less volatile than single stocks; thus, it is used as the input for the stock data channel.In the other channel, the news data are sent to stacked encoder blocks using the self-attention mechanism for feature extraction.We adopt two separate encoder channels when processing both news headlines and preview texts simultaneously.
The highlight of this paper lies in the method of concatenating these multi-channels.Specifically, cross-connections are introduced between different channels to replace self-residual learning procedures.We empirically demonstrate the feasibility and effectiveness of multi-modal CRL with a blending of input sources.Interpreted by a confusion matrix, our proposed architecture yields better predictive results.Furthermore, the experiment implemented also confirms our assumption that the involvement of news content descriptions, if available, indeed upgrades the model accuracy.
We believe there is still a lot of room for improvement to our current work regarding input data sources and the model design.Below we put forward main future directions.More efforts will be devoted to raw data collection, primarily to expand the news dataset.We plan to discover more online news sources, not only from the financial domain, that provide an abstract of contents.Using an increased training dataset size, we expect the proposed architecture to yield more robust outputs.Different input sources such as technical indicators can also be considered to construct other channels.The balance between model complexity and predictive performance needs to be evaluated.Neither too sophisticated model design nor excessive input sources can guarantee an improvement.Moderate complexity is acceptable; however, we recommend further discussions on if the increment of accuracy is worth the higher computational cost.We propose to change the sliding window size dynamically.Too many zero paddings are not desired when linear projecting the news input tensors.In this study, we employ the news data collected on the last day of the observation window.We can also reconsider, for example, using news data for the last three days to capture the lagging effect.Some modifications to the model design may be achievable.For example, some CRL units can be replaced by cross-attentions, i.e., feeding the output of DCC blocks into TE as the 'Key' for MHA.

Figure 1 .
Figure 1.A sample cross-residual module with three tasks.Source: Authors.

Figure 2 .
Figure 2.An overview of the proposed multi-channel cross-residual model architecture.Source: Authors.

Figure 5 .
Figure 5.A schematic diagram of the sliding window.Source: Authors.

Figure 6 .
Figure 6.Prediction accuracy over different output classification settings.Source: Authors.

Table 1 .
Recent research on stock price or index forecasting using financial news.

Table 2 .
Stock trend prediction with different baseline models.The highest values of Accuracy, Recall, and F1-score achieved by the baseline models are shown in bold.

Table 3 .
Stock trend prediction results with different residual learning pattern using different set of news inputs.

Table 4 .
Sample news instances in the dataset collected from Reuters.com.Amazon.comInc is making a push for merchants on its website to sell goods into other countries, setting the stage for greater competition with rival marketplaces run by eBay Inc and potentially Alibaba Group Holding Ltd. (e) Toys 'R' Us says 'making every effort' to pay vendors Toys 'R' Us said at a bankruptcy court hearing on Tuesday that it was working hard to maximise payments to suppliers and lenders, as it starts to shutter 735 big-box toy stores across the United States.