Effective Exploitation of Macroeconomic Indicators for Stock Direction Classification Using the Multimodal Fusion Transformer

An enormous ripple effect can occur in financial data mining if it accurately predicts stock prices. However, predicting stock prices using only stock price data is difficult because of the random nature of stock price data. This paper attempts to fuse data to solve the stock price prediction problem. The following data affecting the stock price are added to the proposed method as an additional modality: macroeconomic indicators and the months and day of the week. The multimodal early fusion method is used, which learns the intermodality correlation of features. The proposed model in this paper outperformed the comparison models and achieved statistically significant results. Specifically, 27 out of 50 stocks achieved higher classification accuracy than the comparative model. In addition, the in-depth analysis indicates that the early fusion strategy achieved better classification accuracy in 30 of 50 datasets than the late fusion strategy for stock price prediction.


I. INTRODUCTION
As the world economy expands, equity markets have grown, and the number of market participants has increased. Stock price prediction has become one of the most popular subjects of financial data mining [1]. Although an accurate stock price prediction can help investors to make the right decision, it is difficult to achieve because of randomness, non-linearity, and the high noise level of the stock price data [2], [3]. Thus, using only the stock price data may be unsuitable for stock price prediction because the stock market is sensitively affected by external factors, such as the world economy, domestic politics, accidental events, and even by the months or day of the week [4].
Researchers use statistical and machine learning models to model the stock price. The autoregressive integrated moving The associate editor coordinating the review of this manuscript and approving it for publication was Tallha Akram . average (ARIMA) model, which combines the autoregressive model, moving average model, and differencing, is one of the most representative statistical models to analyze time-series data [5]. Traditional machine learning models, such as the support vector machine and hidden Markov model, have also been applied in this field [6], [7], [8]. Deep neural networks (DNNs) have recently demonstrated excellent prediction performance [9], [10] owing to their information fusion capability that helps capture nonlinear relationships between various financial data sources [11].
In addition to the stock price data, another information source, economic indicators, can represent economic situations in the market [12]. Thus, by exploiting the economic indicators for predicting stock prices, an improvement in accuracy can be expected. In addition, because the stock price can be correlated to the months and day of the week, the accuracy of the stock price prediction can be enhanced [4]. In addition, the internal structure of information fusion VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ should be carefully designed to achieve the best prediction power [13]. However, conventional studies suffer from performance degradation because of the limited number of considered information sources and the intrinsically conducted structure design of the information fusion. The information fusion strategy in DNNs can be roughly divided into two types according to the location where fusion starts in the network: early fusion, specializing in capturing the intermodality correlation, and late fusion, specializing in capturing the intramodality correlation [14]. Because these two fusion strategies have different strengths in information processing, it is still unclear which fusion strategy will lead to the best prediction power for stock price prediction. This paper proposes a novel multimodal fusion transformer for stock price prediction. Contributions from this study can be summarized as follows: • A novel multimodal early fusion transformer is proposed to achieve accurate stock price prediction. In the proposed network, the early fusion strategy is used to fuse the information between each modality effectively.
• Twenty-five information sources are considered in this study from three domains: stock price, months and day of the week, and macroeconomic indicator modalities.
• An in-depth analysis is conducted regarding the fusion strategy. The proposed method indicated that the early fusion strategy provides the best prediction performance. However, a group of stocks offers better prediction performance using the late fusion strategy. The experimental results on 50 stock price datasets indicated that the proposed multimodal early fusion transformer significantly outperforms conventional methods.

II. RELATED WORK
In the field of time-series prediction, the ARIMA model was frequently employed for developing the model of power generation, traffic flow, and sensor data [15], [16], [17]. Another researcher used the ARIMA model to predict financial time series [5]. The statistical time-series model, like ARIMA, assumes that the data have linearity [18]. It is challenging to apply such assumptions throughout a financial time-series data analysis because the data have traits of nonlinearity and randomness [2], [3].
Despite the excellent performance of a statistical stock price prediction using, for example, the ARIMA model, a series of attempts have been made to exploit machine learning models for stock price prediction. For instance, in [19], the authors used support vector regression to predict stock minute prices using technical indicators. In addition to the support vector regression, the random forest was also used to build a stock price prediction model [8]. The continuous hidden Markov model, one of the most popular modeling techniques for time-series data analysis, was used in [20]. In this work, the emission probabilities are modeled as Gaussian mixture models.
The DNNs can be effective alternatives to conventional econometric and statistical models with weaknesses in modeling financial time-series data with nonlinear traits [21]. For example, a convolutional neural network (CNN) used for image data analysis can be applied to stock price prediction [22]. In addition, long short-term memory (LSTM), bidirectional LSTM, and attention mechanisms model sequential data, such as natural language and time-series data. In detail, the bidirectional attention LSTM model for stock price prediction was proposed in [23]. The model combined with the CNN and LSTM is employed to extract the features of the limit order book and stock price data. The CNN module extracts the features in the limit order book, and the LSTM module extracts the time-series features in stock price data [24].
The transformer model was devised to improve the attention encoder-decoder LSTM model [25], whose variant is widely used in the field of sequence modeling [26], [27]. A stock price prediction model combined with the CNN and attention bidirectional LSTM was also considered [1].
Recently, a transformer model for stock market index prediction was suggested [28]. The attention mechanism in the transformer can learn the correlation of many stocks dynamically with the market index and helps them predict each stock [29]. Some authors [30] insisted that conventional machine learning algorithms ignore the stochastic property of a stock that changes over time. Thus, they proposed adversarial training to learn the stochastic property of the stock price using the attention LSTM.
Recent financial studies have combined various data with stock price data [13]. The information fusion can be enhanced by conducting parallel operations that contain inter/intra crossovers and adaptive mutations in the genetic algorithm [31]. The proposed information fusion-based genetic algorithm approach can optimize the parameters of an extended short-term memory stock price prediction model and selects a set of features. Other authors [32] addressed the problem of technical analysis information fusion in improving stock market index-level prediction. Multiple predictive systems based on various technical analysis metric categories have been devised. The system has multiple prediction models with various technical analysis metric data categories. Each prediction model is based on an ensemble of neural networks using particle swarm optimization. Other authors [33] proposed multimodal neural networks where the architecture learns the cross-correlation of the United States and Korean stock prices.
With recent progress in natural language processing techniques, text data, such as news and Twitter data, were used for feeding the overall atmosphere of the market to the prediction model that each stock data is unable to deliver [34]. An event-driven approach was used with stock price data [35]. The event-driven method was elaborated by extracting the stock-related event from the news titles. Moreover, the method enhancing the joint effects was also devised by calculating the similarity of the two stocks using their p-change values and Pearson correlation coefficient. Historical stock quantitative data, social media data, and web news data could be used by expressing the information as a matrix and tensor for stock market prediction [36]. The stock quantitative feature matrix and stock correlation matrix were created using the quantitative and social media data for the stock. The stock movement tensor was built using events and sentiment extraction from news articles and social media. The stock quantitative feature matrix and stock correlation matrix were factorized, and the stock movement tensor was decomposed for stock price prediction. The bag-of-words and named entity approach using a large corpus of freely available financial reports were used to predict the volatility of the stock returns and stock market prediction using support vector regression [37], [38], [38]. However, these approaches suffer from several issues; the complex nature of natural language processing techniques, the subjectivity in the news sentences or social media, and legal problems such as copyright issues. As a result, the empirical results cannot be reproduced, making the application of those methods to practice extremely difficult. In this paper, we consider a simpler alternative, the macroeconomic indicators, to feed the tendency of the overall market.

III. MATERIALS AND METHOD A. MOTIVATION
First, we discuss the data we used in this study. The macroeconomic indicator is a representative factor affecting the stock market [40]. Some fundamental macroeconomic indicators, such as the exchange rate, interest rate, industrial production, and inflation, have associations with the stock price [39]. In addition to the macroeconomic indicators, the effect of the months and day of the week can also be considered to obtain information affecting the stock market [4]. However, stock prices may only partially reflect the macroeconomic indicator information and the effects of the months and day of the week. Therefore, the model must have information about the macroeconomic indicators and the months and day of the week to reflect this information instantly. Thus, data fusion can be a valid method for accurate stock price prediction because the stock markets are affected by many factors [41] and randomness [2].
Multimodal neural networks that fuses various information sources internally can be considered to solve the randomness problem in data. The early fusion method extracts the intermodality correlation more efficiently than the intramodality correlation [33]. In this paper, the model that learns the correlation between significant data affecting the stock price well is designed to solve randomness problems in the stock price data. The transformer encoder is advantageous for extracting information from sequential data [25]. The query, key, and value have the same value in the transformer encoder. The scaled dot-product self-attention in the transformer encoder learns the correlation between the query, key, and value. The multimodal early fusion transformer method is used in this paper to employ the characteristics of the transformer encoder that learn the correlations between model input features.
In this study, the problem of the stock direction classification is considered that predicts whether stock prices rise or fall in the future because it guides investors to buy a stock before the price rises or sell it before its value declines [42], [43]. Specifically, the proposed method predicts the next day's stock up and down direction using the previous ten days of data as a binary classification task. Thus, the stock direction classification in this study can be formulated as the binary classification problem based on about two weeks of stock market history, where the performance can be evaluated using a conventional accuracy metric. As a result, an early fusion neural network is designed with feature concatenation, a transformer encoder, and a classifier layer. Fig. 1 presents the overview of the proposed method.

B. PROPOSED METHOD
The entire experiment was conducted by setting three modalities: stock price modality, months and day-of-the-week modality, and macroeconomic indicator modality. lists the details of the stock price modality. The KDD17 public dataset was used to add to the reliability of this experiment [44]. The features of the KDD17 dataset are open, close, high, low, volume, and adjusted close. Moreover, ten  technical indicators were added, and two additional features in the stock price modality were extracted through feature engineering to extract significant features in stock price data [45]. Table 2 details the months and day-of-the-week modality. In this modality, the months and day-of-the-week features were extracted using one-hot encoding because the month is a categorical variable with 12 classes, and the day of the week is also a categorical variable with seven classes. The extracted one-hot encoding vector can be expressed as [0, 1, · · · , 0]. Last, in the macroeconomic indicator modality described in Table 3, six features were selected: NASDAQ 100; the US 2-, 10-, and 30-year bond yields; the US Dollar Index; and the WTI oil price. Only the close column in macroeconomic indicators was used in the data, even if the macroeconomic indicators had several columns, such as open, high, low, close, and volume. Before conducting the training and testing processes, the min-max normalization was applied to each ten-day sequence because financial data vary the scale over a long period. Fig. 2 depicts the detailed structure of the former half of the proposed method, including the feature concatenation and multihead attention model. Each modality was concatenated by a feature dimension to fuse data using the early fusion method. We let F 1 , F 2 and F 3 be the feature sizes of the stock price modality, the months and day-of-the-week modality, and the macroeconomic indicator modality, respectively. Then, x 1 ∈ R l×F 1 , x 2 ∈ R l×F 2 , and x 3 ∈ R l×F 3 are the stock price modality, months and day-of-the-week modality, and macroeconomic indicator modality, where l is the sequence length, respectively. These multiple information sources can be concatenated alongside the feature dimension to create an input modality matrix M ∈ R l×(F 1 +F 2 +F 3 ) , represented as follows: (1) In this paper, the months and day-of-the-week modality is represented using a one-hot encoding scheme, resulting in twelve months features and five day-of-the-week features. Thus, F 1 , F 2 , and F 3 were set to 17, 17, and 6, respectively because the numbers of features in each modality were 17, 17, and 6. The day that affects the next day's stock price prediction can be different. The positional encoding module was used to let the model know the position information of the data. The sine and cosine functions express different frequencies [25]. In addition, pos denotes the position, and i represents the dimension. Further, d model denotes the feature dimension of the modality data concatenated along with the feature dimension, and M concatenates three modalities that are added based on the positional encoding: and PE (pos,2i+1) = cos(pos/10000 2i/d model ) Scaled dot-product attention comprises the query, key of dimension d k , and value of dimension d v , its dot-product [46]. The query, key, and value were calculated by multiplying each other's linear layer of dimension d v by the output of the positional encoding. Then, the query and key were multiplied using the dot-product operation. After the dot-product, the multiplied output of the dot-product was divided by √ d k and multiplied by the value. Scaled dot-product self-attention represents the data parts and their relevance to each other. Fig. 3  illustrates the overview of the scaled dot-product attention, represented as Using only a single attention mechanism can be challenging to learn various features in three modalities. Multihead attention can jointly address different representation subspaces that express a variety of features in data. It expresses the diverse subspace representation in the modality. We let   Next, after the input data were passed through the feature concatenation and multihead attention, feature compression feedforward networks extract the information between features extracted using multihead attention. Fig. 4 depicts the latter half of the proposed method. The point-wise feedforward networks have a rectified linear unit (ReLU) activation function that extracts nonlinear features. In addition, W 1 and b 1 and W 2 and b 2 are the weight and bias, respectively, in point-wise feedforward networks. Each point-wise feedforward network was applied to each position separately and identically to the output of the multihead attention: After the feature compression feedforward networks, the feature dimension is 1. In addition, time-sequence compression feedforward networks extract the information through the time-sequence dimension. We let x be the output of the point-wise feedforward networks. Moreover, W f ∈ R d k and b f ∈ R d k are the weight and bias values, respectively, of the feature-compressed feedforward networks. Further, W s ∈ R l and b s ∈ R 1 are the weight and bias of the sequence-compressed feedforward networks: and FFN s (x) = max(0; xW s + b s ).
The output of the sequence-compressed feedforward networks enters the sigmoid function. The sigmoid function is the nonlinear function that makes the input value range [0, 1].

TABLE 4. Comparison of the classification accuracy (%) between the proposed method and comparison models.
The output value of the sigmoid function was used for the input of the binary cross-entropy loss with the target value. Thus, trainable parameters were updated based on the binary cross-entropy loss.

A. EXPERIMENTAL SETTINGS
The KDD17 datasets [44] and macroeconomic indicators were used for stock direction classification. The VOLUME 11, 2023    macroeconomic indicators were collected on Investing.com. The period of the historical price in KDD17 datasets and macroeconomic indicators was 2,518 days from January 1, 2007, to January 1, 2016. The KDD17 dataset has 50 stocks in US markets. This paper's macroeconomic indicators include the NASDAQ 100; the US 2-, 10-, and 30-year bond yield; the US Dollar Index; and the WTI oil price.
A blocked time-series cross-validation was used as a data split strategy to consider the trait of the time-series data [47]. Data leakage can occur if the random split strategy is used because the random split strategy in a time series can make the model predict the next day by examining a future value. In addition, a ten-fold split was used for the entire data length, setting the training, validation, and testing data ratio to 8:1:1. Fig. 5 presents the overview of the blocked time-series crossvalidation.
The proposed model was compared to three comparison models via accuracy. The parameter was set for the three comparison models to the value recommended by each research. A brief review of the three comparison models follows. • Adversarial LSTM [30]: The adversarial training method is a module that captures the stochastic property in stock data. The adversarial training method has been used for computer vision tasks at the data level. Still, this paper concatenates feature-level perturbations during training to express inherent stochastic properties in stock data to overcome the stochastic property in stock price data for stock price prediction.
• DTML [29]: The DTML model is combined with the attention LSTM and transformer model. The authors used the attention LSTM model to extract the sequential feature in the stock data and concatenated the model output. Furthermore, the transformer model is used to learn the correlation between the multiple stocks in the concatenated attention LSTM feature. In this paper, the hyperparameters were set as follows. The batch size was set to 32. The learning rate was set to 0.001. The training progression used 100 epochs, and the model with the lowest validation loss between epochs was used for testing. The Adam optimizer was used for training [48]. The number of transformer model layers was 1. The transformer model dimensions were 320, and the number of transformer heads was set to eight. The loss function was set to binary cross-entropy loss, which is defined as follows: where n, y i , and h(x i ) are the number of data, 0/1 binary target value for the prediction of the proposed model, and the final output of the model through the sigmoid function, respectively. In addition, Accuracy is employed as an evaluation measure: where TP, TN, FP, and FN refer to true positive, true negative, false positive, and false negative. For each dataset, ten accuracy values can be obtained each representing the accuracy of the corresponding fold among ten splits. These values are averaged to represent the performance of the corresponding method. Based on the average accuracy of ten splits, the superiority among compared methods for each dataset can be determined and represented as the rank value. Lastly, the average rank can be obtained by averaging the rank value of each method over all the datasets. Statistical tests were performed to validate the superiority of the proposed method for stock direction classification. The Friedman test is a statistical test for multiple nonparametric comparisons of the methods with multiple datasets [49]. Given k methods and N datasets, r j i denotes the rank of the jth method on the ith dataset (mean ranks are used in case of ties). The Friedman test compares the average ranks of the methods, and R j = 1 rank on the jth method. The null-hypothesis states that all the methods are equivalent and so their ranks R j should be equal. Furthermore, the Friedman test statistic is where Under the null hypothesis, the statistic F F is distributed according to the F-distribution with k-1 and (k − 1)(N − 1)  degrees of freedom [50]. The null hypothesis of the multiple comparisons is rejected when the F F statistic is larger than the critical value under significance level α, or equivalently the corresponding p-value is smaller than α. In this case, the post-hoc test can proceed to make pairwise comparisons between the methods. The Bonferroni-Dunn test was selected as a post-hoc test because it is usually recommended after rejecting the Friedman test's null hypothesis [52]. The performance of the two methods is significantly different if the corresponding average ranks differ by at least the critical difference (CD): where q α/(k−1) is the upper α/(k − 1)-th percentile point of the studentized range distribution with (K , ∞) degrees of freedom [51]. Table 4 reveals the experimental results for the entire dataset in terms of accuracy. The table consists of the ticker, sector, and experimental results for the proposed and three comparison models. The ticker represents the name of a specific stock. The sector of the stock is a unit consisting of similar industries. The experimental results are expressed as the average accuracy and its standard deviation. Table 5 presents the abbreviation of each sector. The proposed method, 27 out of 50 stocks, had the best accuracy compared to the comparison models. With the proposed method, the stock that yielded the best performance was Verizon Communications (VZ), and the difference in accuracy was 8.85%p compared to the comparison model, which can be regarded as a significant difference in the stock direction classification. The Friedman test with a 5% significance level was used to verify whether differences exist between groups. As shown in Table 6, the Friedman test statistic was 24.084, and the p-value was 2.3981e-05. There were statistically significant differences between groups under a significance level of 0.05.

B. COMPARISON RESULTS
As a result, the Bonferroni-Dunn test was applied with a 5% significance level to verify which groups differ. The   TABLE 9. Comparison of the classification accuracy between the early fusion method (proposed) and late fusion method.
CD value was 0.6181 under a significance level of 0.05 in the Bonferroni-Dunn test. The differences in the average rank between the proposed and comparison models were greater than the CD value. Fig. 6 presents the results of the Bonferroni-Dunn test. In the figure, the three comparison models are farther to the left than the CD = 0.61812 compared to the proposed model. The proposed method achieved the best average rank and outperformed all comparison methods. The quantitative analysis confirmed that the proposed method has a numerical difference compared to the comparison model. The statistical tests indicate that the proposed model had a statistically significant difference compared to the competitive models.   Additional comparison experiments were conducted on transformer variant models to determine whether the result of the proposed model is the effect of the transformer or modality fusion. Table 7 presents the experimental results between the proposed and transformer variant models in terms of accuracy. We included DTML [29] in this experiment again because it is based on the transformer architecture. The proposed method gained the best average rank of 1.96 compared to the transformer variant models. The Friedman and Bonferroni-Dunn tests were performed with a significance level of 0.05 to determine whether a statistical difference exists. Table 8 lists the results of the Friedman test, and Fig. 7 displays the results of the Bonferroni-Dunn test. The Friedman test statistic was 18.595, and the p-value was 0.00033. There were statistically significant differences between the proposed model and comparison models. The CD value was 0.6181 with a significance level of 0.05 in the Bonferroni-Dunn test. When comparing the difference in average rank between the proposed and comparison models, among the comparison models, the mean rank difference between the proposed model and the informer model was smaller than that of the CD. The statistical tests found that the proposed model had a statistically significant difference from the comparison models except for the informer model.

C. IN-DEPTH ANALYSIS
In this study, we devised a multimodal transformer based on early fusion strategy. To verify the effectiveness of our choice, the early and late fusion models were compared in terms of accuracy to demonstrate the superiority of early fusion strategy compared to the late fusion model. Table 9 shows the comparison results in terms of accuracy. The proposed early fusion model achieved better classification accuracy in 30 out of 50 stocks compared to the late fusion model. Although the early fusion method achieved the best performance in 3 out of 6 stocks in the financial sector, it acquired the best performance in 6 out of 8 stocks compared to the late fusion method in the energy sector in classification accuracy. Fig. 8 presents the sectors where early fusion methods perform better than late fusion methods, the proportion of the number of stocks, and how well the stocks in each sector perform. Fig. 9 lists the sectors where late fusion methods perform better than early fusion methods, the proportion of the number of stocks, and how well the stocks in each sector perform. Fig. 10 displays the overall results of the sectors in which early and late fusion methods respectively perform better than each other, the proportion of the number of stocks, and how well the stocks in each sector perform comparatively. The early fusion method achieved higher classification accuracy than late fusion in five sectors: materials, communication services, energy, information technology, and cyclical consumer sectors. The late fusion method achieved higher classification accuracy than the early fusion method in three sectors: utility, consumer defensive, and healthcare. The sectors with the same ratio of stocks with high classification accuracy in early and late fusion were the industrial and financial sectors.
The attention map was also analyzed for scaled dot-product attention at the inference level. The corporation for the attention map in Figs. 11 and 12 is Apple Inc., which achieved high classification accuracy compared to the comparison models. If the color in the attention map is white, the corresponding features are highly correlated. If the color in the attention map is black, the corresponding features are weakly correlated. Fig. 11 depicts the attention map for early fusion, and Fig. 12 depicts the attention map for late fusion. Part ''a'' in Fig. 11 represents that the 11 features in stock price modality are highly correlated with the months and day-of-the-week modality and the macroeconomic indicator modality. The attention map for early fusion indicates that each modality (1), (2), and (3) is correlated because the value of each section in the attention map is highly activated. Accordingly, in the early fusion method, each modality learns by referring to the other. However, in late fusion, the attention map is highly activated apart from each modality. The features in the late fusion method are separated and passed through the scaled dot-product attention. Consequently, the early fusion method learns the intermodality correlation better than the late fusion method because all the modalities are processed through the scaled dot-product attention together, which learns the relationship between data in the early fusion method.

V. CONCLUSION
The task of stock price prediction can be challenging if the algorithm depends only on the stock data because the stock price can be affected by external factors, such as the world economy and policy. In this study, in addition to the stock data, two modalities such as macroeconomic indicators and month-and-day-of-the-week provide a global view of the stock market and seasonal/weekly information, respectively. A multimodal fusion architecture was considered when designing the proposed method, which learns the intermodality correlation for features.
The proposed model in this paper outperformed the comparison models and displayed a statistically significant difference. The test results confirmed that the proposed model exhibited a statistically significant difference compared to the comparison models. An in-depth analysis of the comparison results was also performed between early and late fusion methods. The analysis indicates that the early fusion method achieved high classification accuracy in 30 of 50 datasets. Our analysis showed that each strategy demonstrates different strengths for given sectors.
Potential directions of future work include additional data exploratory research in financial data mining. Moreover, the data fusion methodology considering the traits between various data types is also crucial in multimodal stock direction classification. He also studies classification, feature selection, especially multilabel learning with information theory. He is currently the Head and an Associate Professor with the Department of Artificial Intelligence, Chung-Ang University, where he is the Chief of the AI/ML Innovation Research Center. His research interests include machine learning, multilabel learning, model selection, and neural architecture search. VOLUME 11, 2023