1 Introduction

This paper investigates the application of deep learning models between two regionally distinct financial markets. The aim is to determine if such financial models can replicate their performance on regionally distinct, though comparable, markets. Specifically, this study examines this problem using a range of deep learning techniques including fully connected, convolutional, and recurrent networks. Evaluation of the performance of neural network architectures across various regional markets is of interest due to the different microstructure factors present within each market. These factors impact the ability of models to generalise across markets.

Motivation for such research is also driven by the ongoing need of professionals to use the latest tools when investing in financial markets. As a result of the industry-wide push to identify investment opportunities more accurately, there is a continuing drive towards the adoption of predictive algorithms such as neural networks, which have been shown to obtain significant results in several fields, including computer vision, natural language processing, health analytics, engineering, and game-playing [1,2,3,4,5,6,7,8,9]. The attraction to such investment pursuits has driven the value of the world’s stock markets to the point that it is well-known that these markets comprise a substantial quantity of global wealth. A recent estimate places the value of the global equities market at more than USD $110 trillion [10].

The past two decades have also seen enormous growth in the predictive power of neural networks as well as the development of several new neural network classes such as generative adversarial networks and transformers [11, 12]. A symbiotic increase in computational power has fuelled the growth and widespread adoption of deep neural networks. This increasing use of neural networks is predominantly due to their ability to act as automated pattern recognition machines that can be trained on real-world data without an explicit theoretical basis as to the complete inner workings of these models. As pattern recognition machines, neural networks have been used previously to identify patterns in the financial markets [13,14,15,16,17,18,19,20,21,22,23,24].

Two of the most widely used neural network classes in the financial modelling literature are convolutional neural networks (CNN) and recurrent neural networks (RNN). Convolutional neural networks are typically used for computer vision tasks, such as image recognition or video classification. However, due to their ability to extract increasingly abstract and generalised features, they have achieved substantial successes in a broad range of fields, particularly in finance [25]. A common input to convolutional neural networks in finance is raw numerical data. As an example, Gudelek et al. [26] use a 2-D CNN to predict the following day’s stock price for exchange-traded funds (ETFs). They use trend and momentum indicators as inputs to their CNN and are able to obtain a significant and positive return from their backtesting procedure. Hoseinzade and Haratizadeh [27] extend the finance and deep learning literature through the use of a 3-D CNN. They use a set of 82 variables including economic factors, technical indicators and index-specific factors as inputs to this 3D-CNN. Beyond the standard application of convolutional neural networks to raw numerical data, there is a small, but growing body of the literature that utilises visual financial inputs as an alternative style of input data. From the small number of studies, it appears that the use of visual inputs (e.g. price or candlestick charts) is able to produce significant results [28,29,30]. In addition, research is ongoing to improve our ability to understand what features of the input data (or features extracted within hidden layers) are having the most important effect on the learning and prediction process. Traditional methods such as Hinton diagrams [31, 32] are still in use in the financial literature [17], but there is an increasing body of work that seeks to improve upon the traditional approaches to gain insight into what are often deemed to be black boxes [33, 34].

Recurrent neural networks make allowance for temporal factors and are therefore often used for modelling time-series data, particularly in finance [17, 25]. There are many varieties of recurrent neural networks; however, the most commonly used are the long-short term memory (LSTM) [35] and the gated recurrent unit (GRU) [36]. One particular study of note [17] utilised recurrent neural networks (as well as a suite of deep and machine learning models) to predict the next day return for stocks in the S &P500 index. They were able to achieve significant positive returns using relatively small RNNs. Nelson et al. [37] obtained similarly promising results, but they also provide a suite of technical indicators as well as price history as additional inputs to the LSTM. Matsumoto and Makimoto [38] investigated the performance of various machine and deep learning models in equity investing and found that LSTM models outperformed a range of other candidate models on the S &P500. Similar results were also obtained by Fischer and Krauss [19]. It is also becoming more common for a variety of neural networks to be combined to create a hybrid network, such as the LSTM-CNN employed by Kim and Kim [39]. A similar study by Liu et al. [40] also used an LSTM-CNN to perform strategy analysis, as well as to improve stock selection and timing, and found that their hybrid neural network was able to outperform two benchmarks: the respective index and the classic momentum strategy.

Overall, as demonstrated above, there are a variety of ways in which financial data can be input to machine learning models. Candlesticks are one such representation of the standard stock time-series [41]. It has been argued for some time that both finance academics and practitioners would benefit from better understanding the predictive information contained in candlestick charts (see, e.g. [42, 43]). To this end, research has begun that examines the ability of deep neural networks to extract such patterns [44, 45]. Indeed, a systematic review of the contemporary literature investigating the development and application of machine learning to the equities markets has been recently completed by [46] that documented the different machine learning categories that dominate the literature. They make three main conclusions in their review. The first is that there needs to be an increased emphasis on the generalisability of results from machine learning studies. As a result, they suggest that models and approaches should be evaluated across several distinct markets in the future research. The second conclusion notes that the use of machine learning techniques for financial modelling work (regardless of whether it is a black box or not) needs to have due consideration for the financial theory in terms of the inputs to the model, the algorithms utilised, and the subsequent performance analysis. They also conclude that artificial neural networks are best suited for regression-style problems in this area, while support vector machines are better suited for classification tasks.

In keeping with the direction of the literature, the work of Ghoshal and Roberts [17], which was published in an earlier volume of this journal, developed and compared several networks trained on 22 years of US equity data. They found that optimised neural networks outperform standard technical and other shallow learning methods. By examining the weight-space visualisation (Hinton diagrams) of their CNN, they provide a visual interpretation of what the network has recognised as significant candlestick sequences. Their validated best model is statistically significantly better than random choice at predicting the direction of the next day’s returns. However, and in keeping with the first conclusion of Strader et al. [46], they did not attempt to develop or apply their model to other markets. Such extensions are commonplace in the financial literature. Works such as [47, 48] investigate the efficacy of methodologies developed in one market that are then applied to other distinct regional markets. As such, the motivation of this current work is to extend the work of Ghoshal and Roberts [17] to the Australian equities market since in doing so, it will address the key research gap that exists as their approach has not yet been applied to other markets or over various market cycles.

The contributions of the current paper are twofold. First, it provides a continuation study of Ghoshal and Roberts’ [17] work on Australian data as supported by the call for future research by Strader et al. [46]. We address their comment on the need to assess results over various market cycles. Secondly, the universal workflow of machine learning [25] is used to independently develop models that best fit the Australian data. Together these contributions address the benchmarking and application of neural network models developed on one market to similar, but geographically separated markets. As such, the key objective of this work is to determine the performance differential of deep learning architectures between regional markets. An additional objective of this work is to begin to address the lack of generalisability of results that has been identified as a key issue in the financial literature [46].

The remainder of this work proceeds as follows: First, the methodology is presented in Sect. 2, including emphasis on the data, deep learning techniques and the training methodology. The results are then discussed in Sect. 3, where it is shown that the findings are statistically significant. Finally, the work concludes in Sect. 4 and future research directions are discussed.

2 Methodology overview

This section provides an overview of the methodological details involved in this study. Neural networks have dominated in popularity over the past decade as the go-to modelling methodology for pattern recognition applications [25]. The application of these pattern recognition models to financial data is a natural step in the application of deep learning models, and their use within finance research has grown substantially over the past decade [46]. The use of deep neural network architectures that follow in this paper necessarily follows Ghoshal and Roberts [17] given that this is a comparative study. A description of the data used herein, and the methods used to ensure a valid comparison with the previous work are described. Details specific to the construction of balanced datasets are provided to ensure conformance and reproducibility with the methodology adopted in Ghoshal and Roberts [17].

2.1 Candlestick data

In keeping with this work being a continuation study, candlestick data from the Australian stock market were collected for training, validation, and test data. Ghoshal and Roberts [17] justify the use of candlestick data as they note that it is widely believed by technical analysts to be a leading indicator of future price movements. The raw data were collected for each company from the publicly available Yahoo Finance website. This included the daily Open, High, Low, Close, Adjusted Close price and Volume data. In their study, Ghoshal and Roberts [17] selected the US S &P500 as their market. For consistency, comparable stocks from the Australian ASX50 were collected. This approach ensured that, just as with the S &P500, these 50 Australian stocks have a significant influence on the local market. All data were adjusted in the usual manner on a per-day basis over the entire period by applying the ratio of the adjusted close and the close price on each day to that day’s candlestick values.

For consistency with the original research, this study involved a binary classification for the dependent variable. The dependent variable was either a next-day upward move in the closing price or a next-day downward move in the closing price. In Table 1, the standard data split, broken up by classification, is shown. In this study, great care was taken to ensure that the stocks selected would have sufficient liquidity to allow for realistic results to be obtained. As the Australian market is significantly less liquid than the US market, the top set of stocks with sufficient liquidity is much more limited. Ghoshal and Roberts [17] were able to use a selection of 500 US stocks, where this study was limited to just the top 50 Australian stocks. As such, the judicious manner in which the Australian stocks were selected did not impact upon the continuation of the original study due to the comparable liquidity of the two sets of equities. In addition, the resultant model architectures that were trialled were selected with care to ensure there were sufficient training vectors for the network to be appropriately trained. In order to determine the final set of model hyperparameters, the standard training and validation procedure was applied with the optimization goal of maximising the attained accuracy. Embedded within this optimization was also consideration of the number of trainable parameters used by the model. In the event that two models attained similar performance, but one had substantially fewer trainable parameters than the other, the smaller model would be selected. The entire training and validation process was conducted consistent with standard practice to avoid biasing final out-of-sample results.

Table 1 Summary of final data

2.2 Sequencing and balancing the datasets

As is well-known with binary classification tasks, and as commented upon in Ghoshal and Roberts [17], convergence of the model parameters during training was crucially dependent upon having balanced input datasets. This is a well-discussed, well-known issue in the deep learning literature when training binary classification systems. As can be seen in Table 1, this issue is addressed here in the standard manner of ensuring the training data is equally balanced between the binary values of the daily returns, assessed close-to-close. In balancing the training data, there are three outcomes in the raw data (assuming the stock continues to trade): the price either increases, remains the same, or decreases. The approach that Ghoshal and Roberts [17] adopted was to introduce noise generated from a Gaussian distribution (with a mean of zero and standard deviation of 0.001) to jitter those data sets with zero-return days into one of either the ‘UP’ or ‘DOWN’ classes. This approach satisfactorily reduces the classification task to a binary one for the purposes of training the model. Furthermore, to ensure the entire training dataset is balanced, a threshold is calculated that would place 50% of the jittered returns above that value and 50% beneath. To reduce bias, this dataset-specific threshold was calculated using only the training set and then applied to the training, validation and testing datasets.

In keeping with the approach taken by Ghoshal and Robert [17], the input datasets were then batched into tensors of shape (20, 4). These tensors create a 20-day historical window for each input vector consisting of the Open, High, Low and Close prices of each day. This window overlaps the temporally ordered input training vectors when developing one input training vector per day. It is argued [17] that the rationale for including this historical data window is that it provides a context within which the candlestick pattern of each day could be examined by the neural network. To ensure a fair comparison during the current continuation study, this historical window was kept at 20 days for the Australian data as well.

Finally, each (20, 4) tensor generated (following Ghoshal and Roberts [17]) was individually normalised to ensure the visual appearance of the normalised candlestick is identical to the unscaled candlestick. The (20, 4) input data tensors were stacked into one of either the training, validation or testing datasets, based on date, and consistent with the approach of Ghoshal and Roberts [17]. The resultant tensors had dimensionality (# samples in data split, 20, 4).

2.3 Training methodology and implementation

Ghoshal and Roberts [17] did not explicitly learn any of the standard candlestick patterns. Their approach was to allow the deep learning models to extract potential candlestick patterns as opposed to explicitly learning set patterns based on established candlestick pattern theory. A significant difference between the two approaches is the predictive ability that emerges. Whilst learning theoretical patterns yields apparently superior predictive power, unsupervised learning may be better suited where the nuances contained within the training data (such as geographical location) can be inferred by the network itself.

In justification of their adopted approach, Ghoshal and Roberts [17] used a variety of classic statistical models in addition to machine learning algorithms. The deep learning models produced the most significant results of all their models with Z-scores up to 36.546. That particular result was achieved using a CNN with a single convolutional layer and a filter length equivalent to one trading day. Given a 1-day kernel considers each candlestick individually, and the other 2- or 3-day kernels consider the 2- or 3-day candlestick patterns, this suggests that it is preferable to allow the CNN to identify the combination of individual candlesticks itself, rather than explicitly instructing it to consider more than one candlestick at a time. It is well-known that inputs that carry no information are simply ignored by a network, and thus, in the context of this study, the model determines during training the appropriate number of candlesticks for its representations. The best-performing deep learning model architectures developed by Ghoshal and Roberts [17] were chosen for comparison with the current study. In addition, Ghoshal and Roberts’ [17] finding that deep learning methods outperform classical machine learning methods for this task further motivates this work’s emphasis on deep learning techniques.

Specifically, this work makes use of fully connected, recurrent, and convolutional layers. The multi-layer perceptron (MLP) is a classic artificial neural network and is known as a fully connected or densely connected network [25]. Each neuron in each layer is connected to each neuron in the immediately preceding and succeeding layer (where applicable). It consists of at least three layers: an input layer, any number of hidden layers and an output layer. Dense layers are found in a number of other network architectures, typically as classifiers. Although they are also commonly used independently, meaning that the dense layers complete both the feature extraction and classification.

Recurrent neural networks (RNNs) retain information from the current input using internal structures. This retained information is replaced with each subsequent input [25]. As a result of this information progression, they are commonly used for natural language processing or financial time-series modelling [19, 49]. Dense layers are used in recurrent neural networks to classify the outputs from the recurrent layers. There are many types of RNN, although the RNN architectures used in this study are the long short-term memory (LSTM) [35] and the gated recurrent unit (GRU) [36]. Both models were designed to address the vanishing gradient problem [25]. In addition, they share many similarities, but differ primarily because of the gates used. As a result, the GRU also has fewer trainable parameters.

Convolutional neural networks utilise convolutional layers to complete feature extraction. Convolutional layers are spatially invariant, meaning that features learnt in one area of the input may be applied to other areas of that input. This is a significant improvement from the MLP, since they are spatially variant and as a result require additional parameters to learn the same features. Dense layers are employed in CNNs to classify the features that have been extracted by the convolutional layers. CNNs also learn pattern hierarchies, meaning that local features are extracted first and then combined to create more generalised global features [25]. The primary application of CNNs is to computer vision tasks; however, they can also be used on data such as financial time-series. For additional information on these networks, as well as the mathematical formulations, the reader is referred to the work of Goodfellow et al. [50].

In order to compare the results of the models with those of Ghoshal and Roberts [17], the same set of metrics are adopted. Specifically, the metrics used are accuracy, precision, recall, F-score, area under the receiver operating characteristic curve (AUC), Z-score and P value. The standard formulation for accuracy is adopted, which represents the proportion of predictions which are correct. This is calculated using the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). Precision and recall can also be calculated using these counts. Precision represents the number of true positives relative to the total number of predicted positives, while recall represents the number of true positives relative to the total number of actual positives. The F-score is a combined representation of precision and recall. Following the methodology and notation of Ghoshal and Roberts [17], we calculate the AUC using the popular scikit-learn Python package and then use this result to obtain the test statistic, U, of the Mann–Whitney–Wilcoxon test as per Mason and Graham [51]. Here, the number of positive and negative samples in the holdout set are represented by \(n_P\) and \(n_N\), respectively. The equivalence is then used to obtain the Z-score. We note that the accuracy and Z-score metrics are not directly derived from one another, and however, both metrics do reflect similar attributes of model performance. As such, each provides a different perspective on the overall results. We adopt the standard formulae (in keeping with Ghoshal and Roberts [17]) to define the following measures:

$$\begin{aligned}&\text {Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \\&\text {Precision} = \frac{TP}{TP + FP} \\&\text {Recall} = \frac{TP}{TP + FN} \\&\text {F-score} = 2\times \frac{Precision \times Recall}{Precision + Recall} \\&U = AUC \times n_P \times n_N \\&Z = \frac{U-\mu _U}{\sigma _U} \\&\mu _U= \frac{n_P \times n_N}{2} \\&\sigma _U = \sqrt{\frac{n_P \times n_N \times (n_P + n_N + 1)}{12}} \end{aligned}$$

Finance-specific metrics such as profit and compound annual growth rate (CAGR) are also included for comparison purposes. As per [17], the profit is formulated as a multiple of the starting balance, while CAGR is a representation of the annual rate of return.

$$\begin{aligned} CAGR = \left( \frac{\text{Balance}_{\text {End}}}{\text {Balance}_{\text {Start}}}\right) ^{1/t} - 1 \end{aligned}$$

The models and data manipulation were implemented in Python 3.7.6 using TensorFlow 2.2.0. The typical training time for each model on an NVIDIA GeForce RTX 2080Ti averaged between five and ten minutes each with early stopping enabled. Automated routines were developed specific to this study to ensure a fair and efficient coverage of the parameter phase-space for both models. Training the US-validated model on the Australian data was a straightforward procedure as no model validation was required. The training and validation data splits were used to ensure the best possible alternative model was developed for the Australian market. The comparison of the results of these models is now detailed.

3 Discussion of results

Tables 2, 3 and 4 present the structure of the MLP, CNN and RNN models trained and validated on the Australian regional data. Interestingly, like the US-validated models, the Australian-validated models share the same general structure, in terms of the number of layers and the number of neurons within those layers.

Table 2 The validated MLP architecture for the Australian market
Table 3 The validated CNN architecture for the Australian market
Table 4 The validated LSTM/GRU architecture for the Australian market

Utilising the Z-scores as a measure of the efficacy of each model [17], the continuation study did not return the same level of significance compared to the original financial study overall but did perform better on a few of the metrics, as shown in Table 5. The results for US data were obtained from [17], and those reported for Australian data were completed by this study. These initial results were inconclusive as to the efficacy of the model replication. It was proposed to extend the study to include a model chosen by validation of several original models trained on the Australian data. This extension therefore investigates the effect of market-specific microstructure factors on the selection of the final model architecture.

Upon training these models, every originally developed and Australian data validated model generated more favourable results (for most metrics) than those produced by the American data validated (albeit retrained on the same Australian data) model. From the perspective of deep learning model construction, these better results were achieved with significantly fewer trainable parameters, as shown in Table 6. Given these current results could not statistically discriminate between the predictive capabilities of the best-performing models, the LSTM and 1-Day CNN, the standard practice of choosing the architecture with the fewer weights was used to nominate the best model. This standard practice is the application of Occam’s razor for deep learning models [52]. In this case, given the CNN has utilised 15 times as many weights, the LSTM is thus chosen to be the superior model. In addition, the LSTM is designed for tasks with temporal elements, whereas the CNN is not. This acts as additional support for this selection of the LSTM model. It may also help to explain why comparable results were obtained by the LSTM, but with far fewer trainable parameters.

It is interesting to note that Ghoshal and Roberts’ [17] final models were also substantially larger than those validated in this study (see Table 6). Consequently, it is proposed that, because the Australian market is smaller and less liquid than the US market, fewer significant features can be extracted. This is in keeping with network capacity being proportionate to the size and liquidity of the market of interest. Utilising larger networks validated on larger markets results in masking out the key small regional market features due to the excess capacity of the networks. That is, the excess capacity masks out the other factors that clearly had a larger relative influence in the Australian market. As a result of this, a model developed on multiple regional markets would necessarily need a significantly deeper structure to extract the individual regional market factors.

Table 5 Summary of Ghoshal and Roberts’ model results
Table 6 Summary of model results on Australian data

The practical applications of this work appear in an examination of a simple trading strategy over the holdout test period using the Australian-validated LSTM. Following the methodology adopted by Ghoshal and Roberts [17], the strategy takes positions in those stocks for which the predicted probability on the day exceeds the centile threshold. We determine this centile threshold using the training set and each position is an equally weighted proportion of the portfolio value. The same range of transaction costs used by Ghoshal and Roberts [17] is implemented here, and the cumulative profit and CAGR results are presented in Table 7. There are some striking differences in the results shown in this table, and however, given the differences between the markets (such as regulation and the number of market participants), this variation is to be expected and within reasonable bounds. In addition, Ghoshal and Roberts [17] reported breakeven at a transaction cost of 0.35%, whilst the Australian model breaks even at 0.26%.

Table 7 Comparison of cumulative profits (as a multiple of the starting balance) and compound annual growth rates (%)

The weight-space visualisations shown in Fig. 1 and 2 are known as Hinton diagrams. White and black squares indicate positive and negative values, respectively, while the size of the square indicates the magnitude of the value. Ghoshal and Roberts [17] experimented with a variety of filter sizes, 1, 2 and 3 day, and generated the associated Hinton diagrams. Figure 1 presents the Hinton diagrams from the 1-day CNN of the current study, which is notably different to the diagrams in Ghoshal and Roberts [17]) (see Fig. 2), and consequently demonstrates the difference in features extracted for each market. While an exact interpretation is known to contain a subjective element, there can be no doubt about the markedly different patterns produced. This represents further evidence that the models themselves have extracted inherently different features.

Fig. 1
figure 1

Weight-space visualisation of convolutional layer (Hinton Diagrams) for the current study

Fig. 2
figure 2

Hinton diagrams for Ghoshal and Roberts’ study

4 Conclusion and future research

This paper has extended an important recent US study on Australian data. Upon retraining the original US-validated architectures on Australian data, the results underperformed their earlier performance, suggesting that the US models could not exploit the regional specifics of an Australian market. In comparison, the newly validated Australian models significantly outperformed these original architectures. These results are attributable to the difference in microstructure factors across markets, which impact upon the selection of the final network architecture. Specifically, we find that the Australian-validated LSTM and CNN obtain the most significant results, although the LSTM achieves slightly superior results with 15 times fewer parameters. Given the outperformance of the Australian-validated models over those validated in the US, we propose that regional-specific models are required, that is, the model architectures need to be optimised for each market of interest. As such, a machine learning expert is still required to develop the network architecture as models developed on other markets cannot be effectively applied to a new market. This has implications for both practitioners and researchers in computational finance.

In addition, this study is part of an effort to overcome the lack of generalisability of results, which was an issue identified in a published survey and analysis of the relevant financial modelling literature. This is a notable contribution since many existing studies do not consider the effect that market-specific factors have upon model performance, that is, the transferability of results between markets. A simple trading strategy is developed, which produced above-market returns on the holdout test data. It is suggested that further work be completed to investigate the effect of slippage and other real-world considerations such as the validity of assessing returns close-to-close.

Future research should be conducted on additional regional markets to confirm the findings of this work. Additionally, an open research question is whether a model can be developed and trained across several regional markets with comparable accuracy to the models already developed. Such a model would undoubtedly require a very deep neural network to enable it to infer regional factors. Necessarily this would require additional independent variables, over and above the candlestick data, to be included in the training data for any proposed global model.