Introduction

This paper studies two complementary topics in the empirical asset pricing literature: stock return prediction and machine learning. The traditional forecasting literature applies statistical models to estimate future stock returns along two strands: time series (TS) predictions and cross-sectional (CS) predictions. TS models typically conduct time series regressions of broad aggregate portfolio returns on a number of macroeconomic variables, including valuation ratios and interest rates, and on technical indicators (Cochrane 2011; Rapach and Zhou 2013). CS models, in contrast, usually explain differences in expected returns across stocks as a function of stock-level characteristics, such as size, value, and momentum (Fama and French 1993; Jegadeesh and Titman 1993; Lewellen 2015). This approach typically runs Fama and MacBeth (1973) regressions (FM regressions).

In most applications, in line with the traditional asset pricing literature, TS and CS models consider only linear relationships between financial variables and subsequent stock returns. For example, the Capital Asset Pricing Model (CAPM) introduced by Sharpe (1964), Lintner (1965), and Mossin (1966) posits that, in equilibrium, a stock’s expected return is solely driven by its sensitivity to a systematic risk factor, i.e., the market risk. An assumption is that the underlying pricing kernel is linear in only a single factor, i.e., the market portfolio.Footnote 1 Various studies, however, report violations of this assumption (see, e.g., Hou et al. 2020, for a comprehensive list of asset pricing anomalies) and study alternative asset pricing models. Following Dittmar (2002), we classify them into two subcategories. The first subcategory utilizes other pricing factors in addition to the market portfolio. Most prominently, Fama and French (1993) propose a multifactor alternative to the CAPM and find that it is better at explaining cross-sectional variation in expected returns than the CAPM. Other examples include Ross’ (1976) asset pricing theory (APT) or Merton’s (1973) intertemporal CAPM (ICAPM). The second subcategory abandons the restriction that the pricing kernel must be linear in pricing factors. Bansal et al. (1993), Bansal and Viswanathan (1993), Chapman (1997), Dittmar (2002), and Asgharian and Karlsson (2008), among others, explore various nonlinear pricing kernels and show that such specifications outperform linear counterparts.

While the first subcategory of models motivates the use of multiple pricing factors, the second subcategory suggests that using interactions between these factors and incorporating nonlinear relationships between price-related variables and expected stock returns add incremental explanatory power. A problem arises particularly in a prediction context, because the convexity of the traditional least squares objective tends to emphasize heavy-tailed observations. Therefore, with an increasing number of predictors simple linear models begin to overfit noise rather than extract signals, thereby undermining the stability of the predictions. Since financial data are inherently noisy, predictors potentially multicollinear, and the relationships between predictor variables and expected returns likely interactive and nonlinear as well as time-varying and/or contextual, it seems virtually impossible for simple linear models to generate reliable forecasts of future stock returns.

Machine learning can help overcome the high dimensionality problem by combining many weak sources of information into a strong composite forecast. According to Gu et al. (2020; p. 2225), “[t]he definition of machine learning is inchoate and is often context specific. [It] […] describe[s] (i) a diverse collection of high-dimensional models for statistical prediction, combined with (ii) so-called ‘regularization’ methods for model selection and mitigation of overfit, and (iii) efficient algorithms for searching among a vast number of potential model specifications.”

In our empirical analysis, we exploit a set of twenty-two predictors as per the linear FM regressions approach used in Drobetz et al. (2019). This is our benchmark model, and it is able to explain a substantial percentage of the cross-sectional variation in European stock returns. Against this established conservative benchmark, we compare the performance of different machine learning methods in forecasting stock returns, from both a statistical and an economic perspective. Our objective is to examine whether incorporating interactions between predictor variables and nonlinear effects lead to incremental predictive power.

Our empirical analysis follows that of Gu et al. (2020), with several key differences: First, we use European data. Second, we use an established linear benchmark model with a parsimonious set of twenty-two predictors, rather than a comprehensive set of predictor variables from the voluminous literature. Third, in addition to inspecting changes in a forecast model’s degree of complexity over time, our tests show the importance of each predictor in a given model over the full sample period, and we also scrutinize the time variation of each predictor variable’s importance. This approach can reveal whether (1) some predictors are uninformative during the entire sample period, (2) they lead to a permanent deterioration in a forecast’s signal-to-noise ratio, and (3) they can be removed from the set of baseline variables. Fourth, we apply a conservative transaction cost model to investigate whether the value-added of active trading strategies remains promising under realistic trading assumptions.

Finally, we challenge the traditional formation process of expected return-sorted portfolios, which uses some statistical prediction model to estimate expected returns first, and selects stocks with the highest expected return forecasts into decile portfolios second. The underlying assumption is that stocks with high expected returns ex ante will deliver high realized returns ex post. Rather than taking the “detour” to estimate stock-level expected returns, we take an alternative approach here. In particular, we use a support vector machine to directly classify stocks into decile portfolios based on linear combinations of predictor variables.

In our empirical analysis, we enhance the set of twenty-two predictors from Drobetz et al. (2019) by two-way interactions as well as second- and third-order polynomials to capture nonlinearity. Overall, we find that interactions and nonlinear effects help improve the predictive performance. However, machine learning models must be adequately trained and tuned to avoid overfitting.

Overfitting can manifest along two different strands: model overfitting and backtest overfitting. Model overfitting refers to machine learning models with overly high in-sample fit but poor out-of-sample predictive performance. To avoid model overfitting, we control for the degree of model complexity by tuning relevant hyperparameters. These parameters cannot be preset, but must be determined adaptively from the sample data. Backtest overfitting refers to a researcher’s arbitrariness in choosing firm coverage, sample period, predictors, and tuning parameters. If information from the out-of-sample period is used to fit the models, consciously or not (Schorfheide and Wolpin 2012), this might lead to overstated out-of-sample predictive performance (Bailey et al. 2014, 2017; Harvey and Liu 2014, 2015; Harvey et al. 2016). To avoid backtest overfitting, we include all firms that were or are publicly listed in one of the nineteen Eurozone countries as of December 2020. According to Gu et al. (2020), using large datasets mitigates sample selection or data snooping biases (Lo and MacKinlay 1990) and also can help avoid model overfitting by increasing the ratio of observation count to parameter count. In line with Drobetz et al. (2019), we use a set of twenty-two firm-level predictors that are commonly used in the asset pricing literature (Chordia et al. 2017; Harvey et al. 2016; Lewellen 2015; McLean and Pontiff 2016). We refrain from focusing on only a small subset of covariates that have been shown to perform best in similar prediction tasks. Finally, we opt for the time series cross-validation approach to fit the machine learning models. It maintains the temporal ordering of the data and splits the sample into three distinct subsamples: a training sample, a validation sample, and a test sample (which is used for neither model estimation nor parameter tuning, and thus is truly out-of-sample).Footnote 2

We observe that each forecast model’s degree of complexity varies substantially over time, but they all denote similar predictors as important. The most influential predictors are based on recent price trends, e.g., short-term reversal and stock momentum, and fundamental signals from valuation ratios, e.g., earnings-to-price and book-to-market ratios. Despite these commonalities, our forecast models exhibit differences in statistical predictive performance (as indicated by predictive slopes, predictive \(R^{2}\) metrics, and Diebold and Mariano (1995) test statistics), which also translate into marked differences in economic profitability. The return and risk measures of long-only investment strategies, i.e., sorting stocks into decile portfolios based on expected return estimates and buying the top decile portfolio, indicate that machine learning methods can produce predictive gains. These gains are attributable to predictor interactions and to nonlinear effects that is overlooked by the linear benchmark model. We find that neural networks perform the best overall, also after accounting for transaction costs.

Finally, we compare the performance of traditional expected return-based portfolio formation with a classification-based approach. In particular, we contrast neural networks (which outperform all other traditional approaches in our empirical analysis) with support vector machines (SVMs). The idea behind SVMs is to search for hyperplanes that territorially divide a multidimensional vector space (our sample of firm observations, consisting of stock-level predictors and decile portfolio labels) into groups of vectors that belong to the same class. This allows for the classification of stocks into decile portfolios without taking the “detour” to predict stock-level expected returns. We find that the classification-based approach is superior to even the best-performing expected return-based portfolio formation because it avoids some of the noise in stock-level returns. Most importantly, it is able to maintain the broad return signal, which is essential for our trading strategy, i.e., the correct classification into a decile portfolio.

The remainder of this paper is organized as follows: The second section briefly reviews the literature on stock return prediction and machine learning. The third section describes our dataset. The fourth section summarizes the different machine learning methods and tuning parameters, while the fifth section compares their predictive performance. We contrast the traditional formation process of expected return-sorted portfolios with an alternative approach using classification-based portfolios in the sixth section. The seventh section concludes.

Literature review

Several prior studies are related to our paper. Haugen and Baker (1996), Lewellen (2015), and Drobetz et al. (2019) examine composite estimates of expected excess returns obtained from cross-sectional FM regressions. They present comprehensive evidence that the simultaneous incorporation of multiple firm characteristics as predictors provides strong out-of-sample predictive power for subsequent stock returns. They also challenge the efficient market hypothesis by concluding that stocks with higher expected (and realized) returns are not riskier than their low-return counterparts. In turn, they show that stock return predictions can be used to implement a profitable investment strategy.

Research on machine learning methods in the asset pricing literature is still scant, and different forecast approaches are usually examined separately. Most studies compare different specifications from the same model family, e.g., approaches that conduct variable selection/shrinkage or dimension reduction, tree-based models, and neural networks. For example, Kozak et al. (2020) use the joint explanatory power of a large set of cross-sectional stock return predictors to form a robust stochastic discount factor, contrasting the lasso and ridge penalizations.

Freyberger et al. (2020) and Rapach et al. (2013) also use the lasso framework. The former applies a nonparametric adaptive group lasso version to explore which characteristics add incremental predictive power for the cross-section of expected returns. The latter considers country-level stock returns to examine lead–lag relationships. Giglio and Xiu (2021) focus on a dimension reduction path. They apply a principal component analysis (PCA) method to estimate the risk premiums of factors that are potentially omitted in asset pricing models due to missing or limited observability. Kelly et al. (2019) introduce an instrumentalized version of the PCA approach to re-estimate common equity factors. Incorporating tree-based architectures, Coqueret and Guida (2018) use single regression trees to estimate the association of firm characteristics and subsequent stock returns. Both Leung et al. (2021) and Moritz and Zimmermann (2016) apply ensemble versions that avoid overfitting to study how common equity factors and lagged stock returns predict future stock returns.

Several studies apply neural networks in various specifications, most commonly feedforward multilayer architectures. For example, Levin (1995) and Messmer (2017) examine the predictive power of firm characteristics on subsequent stock returns, while Heaton et al. (2016) assess the applicability of deep learning approaches for multiple regression and classification problems in finance. Chen et al. (2019) and Gu et al. (2021) consider other architectures, such as recurrent long short-term memory networks, generative adversarial networks, and autoencoder networks. Rasekhschaffe and Jones (2019), Wolff and Neugebauer (2019), and Gu et al. (2020) provide a comparison of different forecast models.

Finally, Leung et al. (2000) focus on the sign rather than the level of stock returns. They apply classification-based methods (including linear discriminant analysis, logit, and probit) to predict the direction of stock market movements. Similarly, Huerta et al. (2013) train an SVM to classify stocks into future out- and underperformers.

Data

The market and fundamental data for European firms come from Thomson Reuters Datastream. Our sample is free of survivorship bias, and includes all firms that were or are publicly listed in one of the nineteen Eurozone countries as of December 2020.Footnote 3 We collect all data on a monthly basis, and, if currency-related, denominated in Euro. We only include firm observations that provide full information on excess returns and all twenty-two firm characteristics used in the empirical analysis (no missing values). In every month, we require at least fifty firms with a minimum market capitalization of €25 mill. that meet this full information criterion. This shrinks our sample period to January 1990–December 2020. To calculate excess returns, we use the three-month EURIBOR rate, scaled to the one-month horizon, as the risk-free rate.

We use twenty-two firm-level predictors that are commonly used in the asset pricing literature (Chordia et al. 2017; Harvey et al. 2016; Lewellen 2015; McLean and Pontiff 2016). These variables have been shown to explain most of the cross-sectional variation in expected returns. We also consider two-way interactions between firm characteristics as predictors as well as second- and third-order polynomials of firm characteristics to cover nonlinearity. We assume that market data become public immediately, while fundamental data are known four months after the fiscal year-end. Table 1 gives the definitions of the twenty-two predictors that serve as our starting point for the creation of an extended set of covariates.

Table 1 Descriptive statistics

We follow Drobetz et al. (2019) in constructing our twenty-two variables, who note two important caveats: First, most characteristics represent level variables that change slowly (like size) or flow variables that are measured over at least one year (like book equity). This translates into high persistence over time, which in turn suggests that any predictability in monthly excess returns is likely to extend to longer horizons (Campbell and Cochrane 1999; Cochrane 2008). Second, many of the characteristics are constructed similarly, i.e., issuance metrics, or incorporate similar firm information, i.e., profitability measures, which leads to high correlations. According to Lewellen (2015), the resulting multicollinearity is not of concern in our empirical analysis. This is because we are mostly interested in the overall predictive power of the respective machine learning models, rather than the marginal effects of each single predictor. Machine learning models are suitable for solving the multicollinearity problem by selecting/shrinking variables, etc., and could thus add incremental predictive power.

We adjust our initial sample in two steps: First, we winsorize all monthly firm characteristics at the 1% and 99% levels to correct for outliers. In contrast to Gu et al. (2020), we also winsorize excess returns.Footnote 4 Second, we rank all firm characteristics month-by-month cross-sectionally and then map the ranks into the \(\left( { - 1, + 1} \right)\) interval, following Kelly et al. (2019) and Freyberger et al. (2020).

Table 1 also reports the time series averages of monthly cross-sectional means and standard deviations for excess returns and the twenty-two characteristics as well as the overall sample size. The average universe of stocks covers 832 firms with non-missing observations, ranging from 56 in January 1990 to 1,012 in December 2020. The average monthly excess return is 0.51%, with a standard deviation of 4.55%. The first and second moments of all predictor variables are similar to those found in earlier studies (Lewellen 2015; Drobetz et al. 2019).

Machine learning methods

For our empirical analysis, our benchmark is the modified linear FM regressions approach used in Drobetz et al. (2019). Our main objective is to examine whether incorporating interactions and nonlinearity adds incremental predictive power. Therefore, we compare the predictive performance of different machine learning methods from both a statistical and an economic perspective. As in Gu et al. (2020), we apply an additive prediction error model to describe a stock’s excess return, i.e., \(r_{i,t + 1} = E_{t} \left( {r_{i,t + 1} } \right) + e_{i,t + 1}\), where \(r_{i,t + 1}\) is the excess return of stock \(i = 1, \ldots ,N_{t}\) in month \(t = 1, \ldots ,T\). The expected excess return is estimated as a function of predictor variables and described by the “true” model \(g^{*} \left( {z_{i,t} } \right)\), where \(z_{i,t}\) represents the P-dimensional set of predictors, i.e., \(E_{t} \left( {r_{i,t + 1} } \right) = g^{*} \left( {z_{i,t} } \right)\). Albeit our forecast models belong to different families, they are all designed to approximate the true forecast model by minimizing the out-of-sample mean squared forecast error (MSFE), i.e., \({\text{MSFE}}_{t + 1} = \frac{1}{{N_{t + 1} }}\sum\nolimits_{i = 1}^{{N_{t + 1} }} {(\hat{e}_{i,t + 1} )^{2} }\). \(\hat{e}_{i,t + 1}^{{}}\) is the individual error for stock \(i\) stemming from a particular forecast model, and \(N_{t + 1}\) is the number of stocks within the respective month \(t + 1\). Approximations of conditional expectations \(g^{*} \left( {z_{i,t} } \right)\) are flexible and family-specific. Approximation functions \(g\left( . \right)\) can be linear or nonlinear, as well as parametric, \(g\left( {z_{i,t} ,\theta } \right),\) where \(\theta\) is the set of true parameters, or nonparametric \({ }g\left( {z_{i,t} } \right)\).Footnote 5

Before presenting the results of our empirical analysis, we now briefly introduce the sample-splitting scheme used to fit the forecast models. We then discuss each machine learning method and its key tuning parameters (see Appendix 1, Table 9 for details).Footnote 6

Sample splitting

Machine learning methods can help overcome the high dimensionality problem, which arises when the number of predictors becomes very large relative to the number of observations. They can also help solve the problem of multicollinearity, which can occur when predictor variables are highly correlated. However, they are prone to overfitting, so we must control for the degree of model complexity by tuning the relevant hyperparameters. Examples for tuning parameters are the number and/or depth of trees in boosted regression trees or random forests. To avoid overfitting and maximize out-of-sample predictive power, hyperparameters cannot be preset, but must be determined adaptively from the sample data. In particular, they are selected from a comprehensive set of parameter specifications (see Appendix 1, Table 9 for details). The parameter tuning approach iteratively reduces in-sample fit by searching for the degree of model complexity that will produce reliable out-of-sample predictive performance. To this end, it uses the time series cross-validation approach, which maintains the temporal ordering of the data, and splits the sample into three distinct subsamples: a training sample, a validation sample, and a test sample.

The training sample is used to estimate the model for multiple parameter specifications, while the validation sample is used to tune the parameters. Based on the models estimated from the training sample, the MSFE within the validation sample is calculated for each parameter specification.Footnote 7 The model with the parameter specification that minimizes the MSFE is used for out-of-sample testing. Note that, because the tuning parameters are chosen from the validation sample, it is not truly out-of-sample. The test sample, however, is used for neither model estimation nor parameter tuning. This is why it is truly out-of-sample and appropriate for evaluating a model’s out-of-sample predictive power.

In an asset management context, where new data emerge over time, a sample-splitting scheme that periodically includes more recent data must be applied (see, e.g., West 2006, for an overview). This is why the “rolling window” and “recursive window” methods gradually shift the training and validation samples forward in time. The former method holds the length of training and validation samples constant; the latter increases them progressively. Moreover, because the recursive window approach always incorporates the entire history of data, it is computationally more intensive than the rolling window approach.

Because of this, and because machine learning algorithms are generally computationally intensive, Gu et al. (2020) use a hybrid of the rolling and recursive approaches. They avoid recursively refitting models each month. Instead, they refit once every year, as most of the fundamental firm characteristics are updated annually. Each time, they increase the training sample by one year, while holding the length of the validation sample constant but rolling it forward one year. To maintain comparability with Drobetz et al. (2019), we choose the rolling window approach in our empirical analysis. Instead of annually increasing the training sample, we also hold the length of the training sample constant, and we roll it forward from year to year. We always select eight years for training, two years for validation, and one year for testing from the data.Footnote 8

Forecast models

We use two nested sets of predictors. The parsimonious set covers the twenty-two baseline predictors used by Drobetz et al. (2019), enhanced by forty-four dummies that correspond to the ISO country codes and the first two digits of Standard Industrial Classification (SIC) codes. This set includes \( 22 + 44 = 66\) covariates. The full set adds two-way interactions between the twenty-two baseline predictors as well as second- and third-order polynomials. In total, it includes \(22 + \frac{22 \times 21}{2} + 22 + 22 + 44 = 341\) covariates.

While all forecast models aim to minimize the MSFE, they differ in their overall approach and complexity. Therefore, we outline the major differences among the model families. We begin with simple linear regression models, continue with models that perform variable selection/shrinkage or dimension reduction, and finish with other sophisticated machine learning methods. Furthermore, for each method, we visually examine how the expected return estimates change as we vary the values of a single predictor or a pair of predictors simultaneously within the (\(- 1, + 1\)) interval. We hold all other predictors fixed at their uninformative median value of zero. The visualizations (unreported) confirm that the models that are designed to incorporate interactions and nonlinearity indeed capture these effects.

Ordinary least squares

Ordinary least squares (OLS) regression models are the least complex approach in our empirical analysis. Their aim is to minimize the standard “\(l_{2}\)” objective function, i.e., \(l\left( \theta \right) = \frac{1}{NT}\sum\nolimits_{i = 1}^{N} {\sum\nolimits_{t = 1}^{T} {\left( {r_{i,t + 1} - g\left( {z_{i,t} ;\theta } \right)} \right)^{2} } }\). We further distinguish between two similar but different model specifications. The first is denoted as “ols_pars” and regresses monthly excess returns on the parsimonious set of sixty-six predictors. Using a similar or identical set of covariates, Lewellen (2015) and Drobetz et al. (2019) document that this approach is promising, both from a statistical and an economic perspective. Although they use the FM regressions model that is re-estimated on a monthly basis, the ols_pars model provides nearly identical predictions in our sample-splitting and re-estimation setting.

In particular, we replicate the linear FM regression approach implemented in Drobetz et al. (2019) and compare the expected excess return series. We find a Pearson correlation coefficient close to 1, which can be explained by the rolling window sample splitting with annual re-estimation. In this case, running cross-sectional FM regressions on a monthly basis and averaging coefficients recursively (as in Lewellen 2015, or Drobetz et al. 2019) are very similar to running a pooled OLS regression based on a rolling training sample (as in our empirical analysis).

To ensure comparability with the machine learning models that cannot be re-estimated on a monthly basis (due to computational limitations), we use the ols_pars model as a proxy for the FM regressions approach and our main benchmark. It performs well and thus serves as a conservative threshold for indicating incremental predictive performance. However, the ols_pars model cannot capture interactions or nonlinearity that may add predictive power.

This is why the second model specification is denoted as “ols_full” and regresses monthly excess returns on the full set of 341 predictors. Because we now face a high-dimensionality problem, we expect the ols_full model to perform worse than the ols_pars model. It incorporates predictors that may be highly correlated and possess a very low signal-to-noise ratio. When the number of predictors approaches the number of observations, the unrestricted OLS approach begins overfitting noise rather than extracting signal. This problem becomes worse in a low-signal-to-noise environment. In this case, the ols_full model is expected to become inefficient or even inconsistent.

We use this model specification as our subordinate benchmark in order to emphasize how well sophisticated forecast models can identify and incorporate relevant predictors to improve the predictive performance and to determine which firm characteristics, interactions, and nonlinear effects matter. To overcome the overfitting problem of highly parameterized OLS models and reduce the number of predictors, machine learning algorithms impose variable selection/shrinkage, dimension reduction, and other adjustments.

As already explained, we restrict our sample to firms with a minimum market capitalization of €25 mill. Nevertheless, our results still tend to be dominated by the large number of smaller firms. This is in line with Fama and French (2008), who find that the smallest 20% of stocks comprise only three percent of aggregate market capitalization. These smallest firms are often economically inconsequential to large institutional investors, i.e., such market participants cannot invest enough money as passive shareholders, or face high transaction costs for large trades. To find a balance between the large number of small firms and the small number of very large firms, Grinold and Kahn (2000) propose a generalized least squares (GLS) regression setting. In particular, they suggest to weight each observation by the inverse of its estimated error variance, which they proxy by the square root of its market capitalization. Gu et al. (2020) also suggest this adjustment to improve the prediction efficiency. Therefore, in a robustness test (unreported), we apply a weighted least squares loss function for both the ols_pars and ols_full model specifications to achieve more robust predictions. We use weights that are proportional to the square root of stock \(i\)’s market capitalization at time \(t\). However, because we do not observe improvements in the predictive performance relative to the two OLS-based models, we use the simpler ols_pars and ols_full models as our benchmarks, emphasizing that they are more conservative representatives for this forecast model family.

Penalized least squares

The most common machine learning tool to achieve variable selection/shrinkage in a high-dimensional regression setting is the penalized least squares approach. In particular, it identifies which predictors are informative and omits those that are not. This approach modifies the least squares loss function by adding a penalty term, denoted as \({\Phi }\left( \theta \right)\), to prefer more parsimonious model specifications, i.e., \(l\left( \theta \right) = \frac{1}{NT}\sum\nolimits_{i = 1}^{N} {\sum\nolimits_{t = 1}^{T} {\left( {r_{i,t + 1} - g\left( {z_{i,t} ;\theta } \right)} \right)^{2} } } + \Phi \left( \theta \right)\).

There are two common types of penalties: First, the lasso approach penalizes the sum of absolute coefficients, thereby setting regression coefficients of a subset of predictors to exactly zero (variable selection). Second, the ridge approach penalizes the sum of squared regression coefficients, thereby only pushing coefficients close to zero (variable shrinkage). We follow the elastic net (elanet) approach, which combines the lasso and ridge methodologies.Footnote 9 It computes the weighted sum of both penalties to increase flexibility, i.e., \(\Phi \left( \theta \right) = \lambda \left( {1 - p} \right)\sum\nolimits_{j = 1}^{P} {\left| {\theta_{j} } \right|} + \lambda p\sum\nolimits_{j = 1}^{P} {\left( {\theta_{j} } \right)^{2} }\). Tuning parameters in this forecast model are \(\lambda \in \left( {0,1} \right)\) and \(p \in \left( {0,1} \right)\). \(\lambda\) indicates the general strength of the penalty (in particular, how strongly the regression coefficients are forced to zero); \(p\) denotes the relative weights on the lasso and the ridge approach. \(p = 0\) corresponds to lasso; and \(p = 1\) corresponds to ridge.

Principal component regressions and partial least squares

Although the elanet approach is capable of reducing a highly parameterized model’s complexity by forcing coefficients close to or exactly to zero, it might perform badly if predictors are highly correlated. In this case, where predictors face low signal-to-noise ratios, a superior approach is to create new, de-correlated predictors as linear combinations of highly correlated variables (e.g., by averaging their values) instead of just omitting some. This reduces noise and thus increases the signal-to-noise ratio.

The basic concept of dimension reduction is variable averaging, as opposed to variable selection/shrinkage. Two common methods are principal component regression (pcr) and partial least squares (pls). Both follow a two-step procedure: First, the models conduct dimension reduction by creating new, de-correlated predictors (so-called components). Second, they identify the optimal number of components and then use them in a standard predictive regression model to estimate expected excess returns. The pcr approach forms principal components by incorporating the covariance matrix of the predictors. However, during the dimension reduction step, it does not take into account how the predictors relate to subsequent excess returns.

To overcome this deficiency, the pls approach keeps the forecast objective in mind, even during the dimension reduction step. To form the first component, it runs a univariate OLS regression of realized excess returns on each predictor separately. We can consider the resulting coefficients as reflecting “partial” sensitivity of the realized excess returns to each predictor. It then computes the weighted average of all predictors, using weights proportional to the absolute value of each coefficient. Higher weights are assigned to stronger univariate predictors (with higher absolute coefficient value), and vice versa.

To form further components, during each repetition, the realized excess return and the predictors are orthogonalized with respect to previously formed component(s). The tuning parameter in both forecast models is the number of components numbercomp included in the predictive regression. Using only a certain number of components can be considered equivalent to setting the coefficients for the other components (with low signal-to-noise ratios) to exactly zero.

Random forests and gradient boosted regression trees

Regression trees incorporate multi-way interactions and nonlinearity inherently, without having to add these effects as new predictors. The idea behind regression trees is that they adaptively split the dataset into groups of observations that behave in a similar manner. They follow an iteration process that is inspired by the growing behavior of real trees in nature: First, the process begins with one initial node, the root, in order to find the optimal split variable and the optimal split value by minimizing the MSFE within each partition. This results in two nodes with minimized impurity. Second, to further disentangle the dataset, it determines optimal split variables and values on the subsamples left over from the preceding step(s) to iteratively grow the regression tree. This results in multiple final nodes with minimized impurity, the leaves. The predicted excess return for each leaf reflects the simple average of the realized excess returns of the firms sorted into this leaf. Regression trees are invariant to monotonic transformations of predictors, able to incorporate categorical and numerical data in the same forecast models, and designed to capture interactions and nonlinearity. However, they are prone to overfitting and must be strongly regularized. To accomplish this, the ensemble forecast approach aggregates forecasts from many different regression trees into a single one. According to Gu et al. (2020), there are two common methods: bagging and boosting.

Random forests (rf) modify Breiman’s (2001) traditional bagging approach. The idea is to draw multiple bootstrap samples of the original dataset, fit deep trees independently, and then average their predictions into an ensemble forecast, creating a single strong learner. Because dominant predictors are always more likely to become split variables at low levels, which can lead to large correlations between bootstrap-replicated trees, random forests apply the so-called dropout method. At each potential branch, they randomly drop out predictors, leaving only a subsample of predictors to be selected as potential split variables. The tuning parameters in this forecast model are the depth of trees \(L\), the number of predictors \(M\) randomly considered as potential split variables, and the number of trees \(B\) added to the ensemble prediction.

In contrast, gradient boosted regression trees (gbrt) follow the boosting approach, which is based on the idea that combining multiple shallow trees creates a single strong learner, stronger even than a single deep tree. The iterative procedure is as follows: The gbrt approach computes a first shallow tree to fit the realized excess returns. This oversimplified tree exhibits a high forecast error. Next, it computes a second shallow tree, fitting the forecast residuals from the first tree. The forecasts from these two trees are then added together to form an ensemble prediction. The forecast component from the second tree is shrunk by a factor \(\nu \in \left( {0,1} \right)\) to avoid overfitting the forecast residuals. Each additional shallow tree is fitted to the forecast residual from the preceding ensemble prediction, and its shrunk forecast component is added according to the ensemble forecast. The tuning parameters in this forecast model are the depth of the trees \(L\), the shrinkage weight \(\nu\), and the number of trees \(B\) added to the ensemble prediction.

Neural networks

Neural networks (nn) are the most complex method in our empirical analysis. They are highly parameterized, which makes them suitable for solving very complicated machine learning problems. But they are opaque and can be difficult to interpret. In general, they map inputs (predictors) to outputs (realized excess returns). Inspired by the functioning of the human brain, they are composed of many interconnected computational units, called “neurons”. Each neuron on its own provides very little predictive power, but a network of multiple neurons functions cohesively and improves the predictive performance. We use feedforward neural networks, where each node is connected to all the nodes in the previous layer and the connections follow a one-way direction, from input to output layer. The input layer contains the predictor variables (e.g., lagged firm characteristics), while the output layer contains a prediction for the dependent variable (realized excess returns). The simplest neural network (without any hidden layer) equals the OLS regression model. Adding hidden layers leads from shallow to deep architectures and is able to capture interactions and nonlinear effects (see Appendix 2, Fig. 6, Panel A).

Neural networks predict the output \(y\) as the weighted average of inputs \(x\). In the simplest model, the OLS regression coefficients are taken as weights. In more complex architectures, the weights must be computed iteratively by using the “backpropagation” algorithm. As an initialization, this algorithm assigns random weights to each connection. It also calculates the initial MSFE based on the predictions derived from the inputs of the (last) hidden layer. It then proceeds iteratively as follows: First, it recursively (from output to input layer) computes the gradient of the MSFE with regard to the weights. Second, it adjusts the weights slightly in the opposite direction of the computed gradients, because the objective is to minimize the MSFE. Third, based on the adjusted weights, it re-calculates the MSFE.

The iteration process, known as “gradient descent,” stops when the MSFE is ultimately minimized. Thus far, it is assumed that each node in the hidden layer creates a signal (e.g., it is incorporated into the computation of the weighted average). In the human brain, however, neural networks work somewhat differently. To avoid noise, a specific node transforms each of the preceding signals it transmits (if at all). For example, it may amplify or condense the preceding signals, or only create a signal if the accumulated information is sufficient. At each node, the weighted average of the preceding signals (either from the input or the preceding layer) is used as input \(x\) in an activation function (see Appendix 2, Fig. 6, Panel B). Following Gu et al. (2020), we choose the rectified linear unit (ReLU) activation function and apply it to each node in the hidden layers. To encourage sparsity in the number of active neurons, it only provides a signal as an output if the information from the preceding layer accumulates beyond a threshold: \({\text{ReLU}}\left( x \right) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\;x < 0} \hfill \\ x \hfill & {\text{ otherwise}}. \hfill \\ \end{array} } \right.\)

In our analysis, we consider neural networks with up to five hidden layers (HLs) and up to thirty-two neurons (\(N\)), which we choose according to the geometric pyramid rule (Masters 1993).Footnote 10 In addition to a ReLU activation and a lasso-based penalization of the weights, we simultaneously apply other types of regularization to ensure computational feasibility and avoid overfitting (Gu et al. 2020).

First, we use the stochastic gradient descent (SGD) approach to train the neural networks. During the iteration process, the algorithm cuts the training sample into small random subsamples, so-called batches, and uses one at each iteration. This leads to strong improvements in computational speed. The algorithm still sees the entire training sample (consecutively, not contemporaneously, and at least once but usually multiple times), which helps incorporate all available information and avoids impairing the predictive performance. Consequently, the number of iterations depends on the size of the batches and the number of epochs (i.e., the number of times the algorithm sees the entire training sample).

Second, we adopt the batch normalization algorithm introduced by Ioffe and Szegedy (2015). It mitigates the internal covariate shift that occurs as the distribution of each hidden layer’s inputs changes during the training (as the parameters of the preceding layers change) and slows down the learning process. To this end, within each batch, it cross-sectionally normalizes the input to each hidden layer.

Third, we apply learning rate shrinkage (see Appendix 2, Fig. 6, Panel C). The learning rate determines the size of the incremental steps in the gradient, while iteratively minimizing the MSFE. It faces a trade-off between finding the global minimum instead of the local counterpart (smaller learning rate) and computational speed (larger learning rate). This regularization procedure begins with a larger learning rate to speed up computation. As the gradient approaches zero, it shrinks the learning rate towards zero to overcome a potential local minimum.

Fourth, we implement early stopping (see Appendix 2, Fig. 6, Panel D), as neural networks aim to minimize the MSFE in the training sample. This regularization terminates the SGD iteration process when the MSFE in the validation sample increases for a pre-specified number of iterations (so-called patience), which also speeds up computation.

Fifth, we adopt the ensemble approach proposed by Hansen and Salamon (1990) and Dietterich (2000). We compute ten neural networks from the same specification family at each re-estimation date, using independent seeds.Footnote 11 We then average over the predictions to increase the signal-to-noise ratio, since the stochastic nature of the SGD approach leads to different forecasts for different seeds.

Lastly, in addition to the regularization applied by Gu et al. (2020), we employ the dropout method (similar to that applied for regression trees). It randomly sets a fraction rate of input variables to exactly zero at each iteration, and thus is one of the most effective methods in the neural network framework to prevent overfitting.

Neural networks are computationally intensive and can be specified in an innumerable amount of different architectures. This is why we retreat from tuning parameters (e.g., the size of batches or the number of epochs) and instead pre-specify five different models. We assume that our nn_1 to nn_5 architectures serve a conservative lower bound for the predictive performance of neural network models in general.

Empirical results

In-sample tests

Model complexity

Since we re-estimate each forecast model on an annual basis, it is interesting to gauge whether model complexity changes over time or rather remains stable. We only consider models that can actually exhibit time-varying model complexity in our setting (elanet, pcr, pls, rf, and gbrt). These models pertain to different families, so there is no uniform model complexity measure. For the elanet approach, which performs variable selection/shrinkage, we use the number of nonzero regression coefficients. The pcr and pls approaches conduct dimension reduction, so we consider the optimal number of components included in the predictive regression. But, because the rf and gbrt approaches are nonparametric and tree-based, we take the optimal depth (for rf) and the number of unique split variables (for gbrt) of the trees to measure model complexity.Footnote 12

Figure 1 presents the model complexity of each forecast model at each re-estimation date. In line with the results from Gu et al. (2020), we observe that each model’s complexity varies substantially over time. The time-varying behavior of different approaches within the same forecast model family (namely pcr/pls and rf/gbrt) is similar with regard to low- and high-complexity periods.

Fig. 1
figure 1

Between-models comparison of model complexity over time. This figure presents the model complexity of selected machine learning models (elanet, pcr, pls, rf, and gbrt) introduced in the “Forecast models” section, at each re-estimation date. As the models pertain to different families, the model complexity measure varies: the number of nonzero regression coefficients (for elanet), the optimal number of components included in the predictive regression (for pcr and pls), and the optimal depth (for rf) or the number of unique split variables (for gbrt) of the regression trees. The sample includes all firms that were publicly listed in one of the nineteen Eurozone countries in any given month during the January 1990–December 2020 sample period, while the first estimates of expected excess returns are obtained in December 1999. The data coming from Thomson Reuters Datastream are collected on a monthly basis and, if currency-related, denominated in Euro. Market data are assumed to become public immediately, while fundamental data are assumed to be known four months after the fiscal year-end

Variable importance

Next, because the different forecast models behave similarly in terms of model complexity (at least within each model family), it is further instructive to explore whether the different approaches denote different predictors as most relevant for estimating subsequent excess returns. To this end, we calculate each model’s variable importance matrix based on a two-step approach, separately for each re-estimation date: First, we compute the absolute variable importance as the reduction in R2 from setting all values of a given predictor to zero within the training sample.Footnote 13 Second, we normalize the absolute variable importance measures to sum to 1, signaling the relative contribution of each variable to a model. Figure 2 depicts the time series average of relative variable importance measures for each forecast model separately. We find that, on average, the various forecast models denote similar variables as the most informative (e.g., earningstoprice or ret_2_12). However, some models focus on a highly concentrated set of predictors. For example, tree-based models (rf and gbrt) put most of their weights on only a few variables, whereas the dimension-reducing approaches (pcr and pls) consider a much larger range of predictors to be important.

Fig. 2
figure 2

Model-specific relative variable importance. This figure depicts the time series average of relative variable importance measures of each machine learning model introduced in the “Forecast models” section, which are calculated based on a two-step approach: First, the absolute variable importance is computed as the reduction in R2 from setting all values of a given predictor to zero within the training sample. Second, the absolute variable importance measures are normalized to sum to 1, signaling the relative contribution of each variable to a model. The sample includes all firms that were publicly listed in one of the nineteen Eurozone countries in any given month during the January 1990–December 2020 sample period, while the first estimates of expected excess returns are obtained in December 1999. The data coming from Thomson Reuters Datastream are collected on a monthly basis and, if currency-related, denominated in Euro. Market data are assumed to become public immediately, while fundamental data are assumed to be known four months after the fiscal year-end

To enhance between-models comparability, we rank the average relative contribution of each variable within a specific model and sum the ranks across all models to obtain an overall rank (higher variable importance = higher rank). Figure 3 presents a heat map of relative variable importance ranks. The rows are sorted in descending order based on overall rank. Therefore, the higher a variable is placed, the more important it is overall. Darker cell colors denote greater importance for the respective variable to a model. Again, we find commonalities across relative variable importance metrics, which are even more pronounced within the same forecast model family. For example, both tree-based methods (rf and gbrt) find issues_1_12 strongly informative, while all other forecast models assume this particular predictor is relatively unimportant.

Fig. 3
figure 3

Between-models comparison of relative variable importance. This figure presents a heat map of average relative variable importance ranks of each machine learning model introduced in the “Forecast models” section, which are obtained by ranking the relative contribution of each variable within a specific model, and summing the ranks across all models to obtain an overall rank (higher variable importance = higher rank). The rows are sorted in descending order based on overall rank. Darker cell colors denote greater importance for the respective variable to a model. The relative variable importance metrics are calculated based on a two-step approach: First, the absolute variable importance is computed as the reduction in R2 from setting all values of a given predictor to zero within the training sample. Second, the absolute variable importance measures are normalized to sum to 1, signaling the relative contribution of each variable to a model. The sample includes all firms that were publicly listed in one of the nineteen Eurozone countries in any given month during the January 1990–December 2020 sample period, while the first estimates of expected excess returns are obtained in December 1999. The data coming from Thomson Reuters Datastream are collected on a monthly basis and, if currency-related, denominated in Euro. Market data are assumed to become public immediately, while fundamental data are assumed to be known four months after the fiscal year-end

Finally, we determine whether the relative importance of different variables within a specific model changes over time or rather remains stable. Volatile metrics would indicate that all covariates in the predictor sets are essential; stable figures would suggest removing uninformative predictors permanently, as they can degrade a forecast’s signal-to-noise ratio. To identify the predictors’ time variability in relative importance measures, we rank the relative contribution of each variable within a specific model at each re-estimation date (higher variable importance = higher rank). The trends are similar for all forecast models. Therefore, for the sake of brevity, we only present and discuss the results for our benchmark model (ols_pars). Figure 4 shows the relative variable importance ranks of the ols_pars model at each re-estimation date. Although a few variables tend to stay close to the upper or lower bound, the graph indicates that the relative variable importance metrics change substantially over time. It does not recommend removing specific predictors.Footnote 14

Fig. 4
figure 4

Relative variable importance over time. This figure shows the relative variable importance ranks of an exemplary selected machine learning model (ols_pars) introduced in the “Forecast models” section, at each re-estimation date, which are obtained by ranking the relative contribution of each variable within a specific model (higher variable importance = higher rank). The relative variable importance metrics are calculated based on a two-step approach: First, the absolute variable importance is computed as the reduction in R2 from setting all values of a given predictor to zero within the training sample. Second, the absolute variable importance measures are normalized to sum to 1, signaling the relative contribution of each variable to a model. The sample includes all firms that were publicly listed in one of the nineteen Eurozone countries in any given month during the January 1990–December 2020 sample period, while the first estimates of expected excess returns are obtained in December 1999. The data coming from Thomson Reuters Datastream are collected on a monthly basis and, if currency-related, denominated in Euro. Market data are assumed to become public immediately, while fundamental data are assumed to be known four months after the fiscal year-end

Out-of-sample tests

Statistical predictive performance

Since the forecast models behave similarly from a model complexity and relative variable importance perspective, we aim to determine whether they differ in statistical predictive performance. We therefore conduct three tests: First, in Panel A of Table 2, we follow Lewellen (2015) and Drobetz et al. (2019) and calculate out-of-sample predictive slopes. These allow us to assess the precision and accuracy of the excess return predictions (regardless of any naïve benchmark). Second, in Panel B of Table 2, we follow Gu et al. (2020) and calculate out-of-sample predictive R2 metrics relative to a naïve excess return forecast of zero. We compute the metrics in Table 2 for each model and three subsamples: the full sample, and two subsamples that only contain large or small firms (based on the market capitalization). Third, in Table 3, we assess the relative predictive performance of each model in a pairwise comparison. We conduct a modified version of the Diebold and Mariano (1995) test (DM test) to gauge the differences in out-of-sample predictive power between two models.

Table 2 Predictive slopes and predictive R2s
Table 3 Between-model comparison of predictive performance

Panel A of Table 2 presents the time series averages of out-of-sample predictive slopes (\({\text{PS}}_{{{\text{oos}}}}^{2}\)) calculated from the test sample at each re-estimation date. These are derived from pooled regressions of realized excess returns on the corresponding estimates from a specific model, i.e., \({\text{PS}}_{{{\text{oos}}}} = \frac{{{\text{Cov}}\left( {R_{{{\text{real}}}} ,R_{{{\text{pred}}}} } \right)}}{{{\text{Var}}\left( {R_{{{\text{pred}}}} } \right)}}\). The predictive slopes are close to 1 for the machine learning models, indicating that forecast dispersion primarily reflects cross-sectional variation in true expected excess returns.Footnote 15 They are much smaller for the OLS-based models and even more pronounced for the full predictor set, which is due to the overfitting that can arise because it contains too many predictors with a low signal-to-noise ratio.

Importantly, the predictive slopes for the neural network models deteriorate quickly in the number of hidden layers, which is in line with Gu et al.’s (2020) findings and might be due to overfitting. Deeper neural networks seem to be too complex for the relatively small dataset and parsimonious set of predictors, and/or the monthly excess return setting. Since this pattern holds for other measures of statistical and economic predictive performance, we only present and discuss the results for the simplest nn_1 architecture going forward.

Considering the full sample, the predictive slopes are closest to 1 for the tree-based models (1.03 and 1.07 for rf and gbrt) and the neural network architecture (0.98 for nn_1), followed by the dimension-reducing models (0.74 and 0.76 for pcr and pls). Note that selecting predictors manually (ols_pars with 0.65) is inferior to using a rule-based selecting/shrinking technique (elanet with 0.92), while incorporating the full set of 341 predictors performs the worst (0.45 for ols_full). Although the patterns are similar for the two market capitalization-based subsamples, the magnitudes of the predictive slopes are substantially higher for the “small firms” subsample. There are two likely reasons for this: First, the models are estimated from the full sample, so differences in cross-sectional variation in true expected excess returns between the subsamples could explain differences in predictive slopes. Second, machine learning algorithms could forecast the excess returns of small firms indeed better.

Panel B of Table 2 presents the time series averages of predictive \(R^{2}\)s (\(R_{{{\text{oos}}}}^{2}\)) calculated from the test sample at each re-estimation date. These are derived by comparing a model’s \(R_{{{\text{oos}}}}^{2}\) with the numbers of the naïve excess return forecast of zero, i.e., \(R_{{{\text{oos}}}}^{2} = 1 - \frac{{{\text{SSE}}\left( {R_{{{\text{real}}}} ,R_{{{\text{pred}}}} } \right)}}{{{\text{SSE}}\left( {R_{{{\text{real}}}} ,0} \right)}}\).Footnote 16 Considering the full sample, our results are in line with Gu et al. (2020) and emphasize our prior findings. In particular, the ols_full model exhibits the worst predictive power, a negative \(R_{{{\text{oos}}}}^{2}\) metric (− 0.52%), while the ols_pars model avoids overfitting, resulting in a substantially larger number (0.57%). Each of the remaining machine learning models beats by far the ols_pars benchmark in the sense of higher \(R_{{{\text{oos}}}}^{2}\) metrics. Restricting the complexity of the ols_full model by variable selection/shrinkage (1.21% for elanet) or dimension reduction (1.24% and 1.08% for pcr and pls, respectively), or by adding interactions and nonlinearity to the ols_pars model with a tree-based approach (1.18% and 1.14% for rf and gbrt, respectively) or neural network architecture (1.23% for nn_1) boosts the predictive performance. Again, the patterns remain qualitatively the same for the two market capitalization-based subsamples, with substantially higher \(R_{{{\text{oos}}}}^{2}\) metrics for the “small firms” subsample.

Table 3 presents DM test statistics to enable a pairwise comparison of predictive performance. We again follow Gu et al. (2020) and apply a modified version of the DM test to compare the forecasts models’ monthly MSFEs, i.e., \({\text{MSFE}}_{t + 1} = \frac{1}{{N_{t + 1} }}\sum\nolimits_{i = 1}^{{N_{t + 1} }} {(\hat{e}_{i,t + 1} )^{2} }\).Footnote 17 The DM test statistic for comparing column model \(j\) and row model \(k\) is \({\text{DM}}_{kj} = \frac{{\overline{d}_{kj} }}{{\hat{\sigma }_{{\overline{d}_{kj} }} }}\), where \(d_{kj,t + 1} = {\text{MSFE}}_{t + 1}^{\left( k \right)} - {\text{MSFE}}_{t + 1}^{\left( j \right)}\) is the difference in MSFEs, \(\overline{d}_{kj} = \frac{1}{T}d_{kj,t + 1}\) is the time series average of these differences, and \(\hat{\sigma }_{{\overline{d}_{kj} }}\) is the Newey and West (1987) standard error of \(d_{kj,t + 1}\) (with four lags to account for possible heteroscedasticity and autocorrelation). We follow the convention that positive signs of \({\text{DM}}_{kj}\) indicate superior predictive performance of column model \(j\), i.e., that it yields, on average, lower forecast errors than row model \(i\). As DM test statistics are asymptotically \(N\left( {0,1} \right)\)-distributed and test the null hypothesis that the divergence between two forecast models is zero, they map to p-values in the same way as regression t-statistics.

We interpret the DM test statistics derived from the pairwise comparison in two different ways: First, we consider them separately, without taking into account that other pairwise comparisons are conducted simultaneously. For a significance level of 5%, the threshold for the test statistics to indicate significance is 1.96. We denote single-comparison significance with boldface. Second, we address the multiple comparison issue by using the Bonferroni correction, which divides the 5% significance level by the number of simultaneously conducted comparisons. Comparing eight models, this adjustment increases the significance threshold to 2.50. We denote multiple-comparison significance with an asterisk.

We observe that our results underline the tendencies found so far. Each machine learning model beats the OLS framework, which holds for both single- and multiple-comparison cases. One idiosyncrasy is the nn_1 model, which slightly misses significance relative to the ols_pars benchmark (1.55). Moreover, since the remaining statistics are insignificant, it is impossible to draw any further conclusions regarding the relative outperformance of the other forecast models.

Expected return-sorted portfolios

The various models exhibit similar behavior with regard to model complexity and relative VI, but have substantially different out-of-sample predictive power. Note that the tests conducted in the “Statistical predictive performance” section are statistical in nature, while Leitch and Tanner (1991) suggest that there may be only a weak association between statistical measures and economic profitability. Therefore, it is helpful to understand whether differences in statistical predictive performance translate into differences in predictive power from the economic perspective of a realistic trading strategy. At the end of each month, we first sort stocks into decile portfolios based on the respective excess return estimates. For each model and decile portfolio, we then calculate the equal- and value-weighted mean of ex ante predicted and ex post realized excess returns. Based on the realized excess returns, we also compute the standard deviation, annualized Sharpe ratio, and the t-statistic testing the null hypothesis that realized excess returns are zero. Table 4 presents the time series averages of monthly figures.

Table 4 Excess returns for expected return-sorted portfolios

In general, the patterns are similar for the equal- and value-weighted decile portfolios, although they are substantially more pronounced for the equal-weighting scheme (in terms of higher Sharpe ratios and t-statistics). The average realized excess returns line up almost monotonically with average predicted excess returns, resulting in positive H–L spreads that are statistically significant and economically large. The H–L returns are, on average, more than 75% higher for the equal-weighted decile portfolios, indicating that a substantial portion of the excess returns is driven by small firms. However, all H–L spreads, using both the equal-weighting and the value-weighting scheme, suggest economic profitability. The results highlight that all presented approaches capture cross-sectional variation in realized excess returns, but the potential for providing profitable trading signals differs substantially across them.

We compare the equal-weighted H–L returns, which tend to be driven more by the long side (decile 10) than the short side (decile 1). The values indicate that differences in statistical predictive performance translate at least partly into differences in economic profitability. Again, the ols_full model performs the worst, the ols_pars model performs slightly better, and the remaining forecast models outperform substantially. The realized excess returns of decile portfolios based on the nn_1 architecture are most promising, ranging from − 1.25% to 2.01% per month, resulting in a monthly H–L spread of 3.26%. This implies a return ratio, i.e., ratio between realized and predicted average H–L spread, of 1.04 (= 3.26% \(\div\) 3.13%), which is almost identical to the time series average of predictive slopes (see Table 2, Panel A). The annualized Sharpe ratio is 3.89, with a corresponding t-statistic of 13.96.

Given that higher average returns could be caused by higher systematic risk, it is necessary to examine risk-adjusted performance. We calculate alphas of the decile portfolios relative to the CAPM and the Fama and French (2015) five-factor model (FF5M), which we extend here by using Carhart’s (1997) momentum factor. As the results are similar but less conservative for the unconditional version, where betas are assumed to be time-constant, we only present conditional alphas for the sake of brevity. The beginning-of-month aggregate dividend yields are used to capture time variability in betas (Chordia and Avramov 2006; Ferson and Schadt 1996; Shanken 1990). Table 5 presents the time series averages of monthly figures.

Table 5 Conditional alphas for expected return-sorted portfolios

Overall, the risk-adjusted values are very close to their raw counterparts, suggesting that the positive average H–L returns are not driven mainly by higher systematic risk. Additionally, all the patterns observable in Table 4 emerge in Table 5 as well. The performance is better for the equal-weighting scheme, the alphas line up almost monotonically with the average predicted excess returns (regardless of the underlying systematic risk model), and the results hold for all forecast models.

Investment portfolio performance

Since the decile portfolio sorts suggest that some forecast models are more promising than others, we next explore whether these models provide superior value-added to investors under realistic trading assumptions. We visually summarize the cumulative performance for long- and short-only investment strategies based on each forecast model, taking the cumulative market excess return as a benchmark. We derive cumulative performance by applying a rule-based approach. At the end of each month, it selects the top decile stocks with the highest expected excess returns, or the bottom decile stocks with the lowest expected excess returns.Footnote 18 We present the results for both weighting schemes. Figure 5 depicts the cumulative performance of investments of €1 in each strategy at the beginning of January 2000.

Fig. 5
figure 5

Cumulative performance of expected return-sorted portfolios with monthly restructuring. This figure depicts the performance (logarithmic scale) of the market portfolio and each machine learning portfolio introduced in the “Forecast models” section. Portfolio values are scaled to €1 at the beginning of January 2000 and presented for long- and short-only portfolios, for both the equal- and value-weighting scheme, and gross of transaction costs. The calculation of transaction costs and the choice of applied transaction costs are explained in the “Investment portfolio performance” section. The sample includes all firms that were publicly listed in one of the nineteen Eurozone countries in any given month during the January 1990–December 2020 sample period, while the first estimates of expected excess returns are obtained in December 1999. The data coming from Thomson Reuters Datastream are collected on a monthly basis and, if currency-related, denominated in Euro. Market data are assumed to become public immediately, while fundamental data are assumed to be known four months after the fiscal year-end.

The results again reflect our previous findings and are qualitatively comparable for both weighting schemes. All investment strategies are capable of dissecting the market universe into high- and low-performing stocks. However, that ability varies substantially among the forecast models and leads to considerable performance discrepancies. Again, we observe that the ols_full model performs the worst, the nn_1 model performs the best, and the remaining models lie in between.

As the portfolio performance tends to be driven by the long side rather than the short side (see Tables 4 and 5), and due to widespread short-selling restrictions in the industry, we now focus on a long-only version of our investment strategy.Footnote 19 We compute return and risk figures for the market portfolio and each forecast portfolio. We also take into account transaction costs when measuring performance. In line with Bollerslev et al. (2018) and Koijen et al. (2018), we compute the portfolio’s transaction costs (PTC) by incorporating the portfolio’s one-sided turnover (PTO) and half-spread (PHS) at the end of each month \(t\).Footnote 20

Table 6 depicts the return and risk metrics for each portfolio (both before and after transaction costs). The terminal value is presented at the end of December 2020, as well as annualized return and volatility, maximum drawdown, annualized Sharpe ratio, and annualized information ratio (relative to the market portfolio). It also shows actual and applied transaction costs.

Table 6 Cumulative performance of long-only forecast portfolios with monthly restructuring

Independent of the weighting scheme, all the forecast portfolios substantially outperform the market. However, our main objective is to examine whether interactions and nonlinear effects add incremental predictive power to a parsimonious forecast model that has already been proven to perform well. This is why we use the ols_pars model as our conservative benchmark. In addition, as we are primary interested in the forecast models’ ability to select stocks ex ante that tend to perform well ex post, we opt for the naïve equal-weighting scheme in presenting and discussing investment portfolio performance.

Ignoring transaction costs, the ols_pars portfolio yields a terminal value of €70.29. The excess return is substantially higher than that for the market portfolio (22.45% vs. 4.96%), while volatility is only slightly elevated (18.20% vs. 16.21%). This increases the Sharpe ratio (1.24 vs. 0.31). Moreover, the OLS-based forecast portfolio exhibits a lower maximum drawdown (51.24% vs. 60.32%) as well as a high information ratio (3.57). Because the ols_pars model outperforms, it is suitable to serve as our conservative benchmark. However, adding interactions and nonlinear effects to the ols_pars model without any restriction, leading to the ols_full model, weakens the return and risk figures.

Implementing regularization (i.e., variable selection/shrinkage or dimension reduction) to restrict the complexity of the ols_full model, or adding interactions and nonlinearity with a tree-based approach, adds substantial predictive power. The volatilities are slightly higher than those of the ols_pars model (ranging from 18.54% for pls and 19.07% for pcr), and the excess returns range from 23.08% (elanet) to 23.77% (gbrt). The machine learning portfolios sharply increase the terminal value relative to the ols_pars portfolio, with similar figures for maximum drawdown, Sharpe ratio, and information ratio. The nn_1 model performs the best: It raises the terminal value relative to the ols_pars portfolio by €36.89 (or 52.48%) to €107.18 and increases the excess return to 24.93%, with similar figures for standard deviation (17.82%) and maximum drawdown (50.12%). This translates into the highest Sharpe ratio (1.41) and information ratio (4.27).

Next, we account for transaction costs. The average PTO ranges from 27.17% (ols_pars) to 39.54% (ols_full) and is slightly higher for the value-weighting scheme. In contrast, the average PHS is much higher for the equal-weighted scheme (60 bps vs. 20 bps). By definition, it puts larger weights on smaller stocks that are less liquid and therefore more expensive (in terms of higher bid-ask spreads). As a result, the average PTCs are nearly three times larger for the equal-weighted portfolios.

Although the transaction cost models of Bollerslev et al. (2018) and Koijen et al. (2018) are quite conservative, we follow an even more cautious approach and deduct 57 bps from each portfolio’s monthly excess returns.Footnote 21 The actual transaction costs are many times lower than this deduction, and therefore the applied transaction costs offer a conservative lower bound for the performance of a realistic trading strategy. Overall, each equal-weighted portfolio continues to outperform the market. The performance of value-weighted portfolios (except for pcr and nn) falls slightly under the benchmark. We attribute this to the overly conservative transaction cost discount.

Classification-based portfolio formation

Our results indicate that interactions and nonlinear effects are important and could be exploited by adequately trained and tuned machine learning models. The forecast models handle the high dimensionality issue with varying degrees of success, resulting in different levels of predictive performance. To evaluate the economic predictive power, we compare the cumulative performance of investment portfolios. They are formed through a two-step procedure that uses forecast models to derive expected return estimates first, and selects stocks with the highest expected return forecasts into long-only portfolios second.

Considering this “detour” through the estimation of stock-level expected returns, we next study whether a single-step portfolio formation process can further improve the predictive performance. We follow the popular learning-to-rank (LTR) classification approach. Instead of incorporating realized excess returns as labels to train a regression model, the stocks from the training and validation sample are cross-sectionally sorted into decile portfolios based on realized excess returns. These sortings are then used as aggregate labels in fitting a rank-based classification model. Top-decile stocks are preferable to bottom-decile stocks, so decile portfolio labels serve as preference ranks.

Such pre-estimation sortings, rather than classifying stocks based on expected return estimates, can help avoid the noise in stock-level excess returns, while maintaining the signal that is most important from a trading perspective, i.e., the correct classification into a decile portfolio. The LTR approach computes probability estimates for the affiliation of a stock to each decile portfolio separately and then sums these probabilities into an aggregate probability score, which it uses to classify the stock into a specific decile portfolio.

Machine learning method: Support vector machines

As discussed earlier, support vector machines (SVMs) are a popular classification method. The idea behind SVMs is to search for hyperplanes that territorially divide a multidimensional vector space (our sample of firm observations, consisting of stock-level predictors and decile portfolio labels) into groups of vectors that belong to the same class. Each potential hyperplane is located in an area where vectors of two different classes are close together. To increase computational speed, SVMs do not always use all vectors from the vector space. Rather, it focuses on those in the immediate neighborhood of the potential hyperplane, or so-called support vectors. The algorithm then specifies the optimal hyperplane by aiming to (1) maximize the distance of correctly classified support vectors from the hyperplane, and (2) minimize the number of misclassified support vectors.

In a multi-class scenario, SVMs fit optimal hyperplanes by means of a pairwise class comparison. In theory, by allowing for an unrestricted nonlinear transformation of the vector space and with indefinite computational power and time, the algorithm can avoid any misclassifications. But since this likely leads to overfitting, regardless of obvious computational limitations, SVMs must be strongly regularized.

To this end, we use a radial basis function (RBF) kernel for a proper nonlinear transformation of the vector space. We simultaneously apply three types of regularization. First, we constrain the influence of any single vector, i.e., we restrict the space within which it can serve as a support vector. A smaller vector influence \(\gamma\) avoids enabling vectors to serve as supports for overly distant hyperplanes. Second, we set the permitted number of misclassified vectors to a positive value, i.e., we allow for a certain number of misclassifications. Smaller misclassification costs \(c\) allow us to ignore more of the misclassified support vectors, while continuing to fit optimal hyperplanes. To tune both parameters, we select a loss function that meets our objective to classify stocks into ranked decile portfolios, based on the confusion matrix that contrasts predicted and realized classes. For binary (two-class) classifications, potential outcomes are true positives (TPs) if the predicted and realized class equal 1, true negatives (TNs) if predicted and realized class equal 0, false positives (FPs) if predicted class equals 1 but realized class equals 0, and false negatives (FNs) if predicted class equals 0 but realized class equals 1.

In a two-class scenario, classification performance is usually evaluated using confusion matrices, together with a broad set of classification measures, in particular accuracy, i.e., \(\frac{{\# \left( {{\text{TP}} + {\text{TN}}} \right)}}{{\# \left( {{\text{TP}} + {\text{FP}} + {\text{FN}} + {\text{TN}}} \right)}}\), sensitivity, i.e., \(\frac{{\# {\text{TP}}}}{{\# \left( {{\text{TP}} + {\text{FN}}} \right)}}\), and specificity, i.e., \(\frac{{\# {\text{TN}}}}{{\# \left( {{\text{TN}} + {\text{FP}}} \right)}}\). Multi-class classification metrics are derived as follows: Separately for each class, the binary one-against-all approach is used to compute hypothetic two-class numbers. It considers the class under investigation as 1, and all remaining classes as 0. Aggregate metrics are then calculated as the weighted average of each class’ individual metric, taking the number of realized cases within each class as weights. Note that the aggregate accuracy can equivalently be computed as the number of true classifications (cases in the cells on the confusion matrix’s left diagonal) divided by the number of total classifications (cases in any cell of the confusion matrix). To give an example, Table 10 of Appendix 2 provides a visualization of a three-level classification with inherent rank order (\(3 \succ 2 \succ 1)\). The multi-class accuracy measure in this illustration is \(\frac{10 + 10 + 10}{{100}} = 30\%\) for both scenarios.

Traditional classifications would aim to maximize this multi-class accuracy. However, this loss function is not applicable for an LTR classification because it does not consider the inherent rank order, i.e., the severity of the misclassifications. Although exhibiting the same 30% accuracy, the predictions in scenario A differ from their realizations more strongly than in scenario B. To put this in a trading context, selecting a stock that is expected to be in the best decile portfolio, but is actually in the second-best decile portfolio, is much less problematic than accidently selecting a stock from the worst decile portfolio. This is why the goal of our SVM approach is to minimize an objective function that penalizes the difference between predicted and realized class ranks, i.e., \(\Phi \left( \theta \right) = \sum\nolimits_{i = 1}^{N} {\left| {{\text{pred}}_{i} - {\text{real}}_{i} } \right|}\), where \({\text{pred}}_{i} \in \left\{ {1, \ldots ,10} \right\}\) and \({\text{real}}_{i} \in \left\{ {1, \ldots ,10} \right\}\) are the predicted and realized decile portfolio ranks of stock \(i \in \left\{ {1, \ldots ,N} \right\}\), respectively.

Multi-class SVMs are computational intensive Therefore, adapt the ensemble approach from the neural network architecture (see the “Neural networks” section). At each re-estimation date, we randomly split the training sample into ten distinct subsamples and train ten independent SVMs from these. This increases the computational speed, but still enables the algorithm to see the entire training sample. We average the different predictions into a single ensemble prediction, which additionally increases the signal-to-noise ratio.

Our SVM approach uses a scoring model to classify stocks into decile portfolios. Based on the disaggregated vector space, it estimates the likelihood that a specific stock will be part of a specific decile portfolio in the following month \(t + 1\). It aggregates those decile-specific probabilities into a probability score \(S_{i}\). It is higher for stocks that are ex ante expected to be in the decile portfolio with high ex post realized returns, i.e., \(S_{i,t} = \sum\nolimits_{j = 1}^{10} {w_{j} } \times p_{ij,t}\), where \(w_{j} \in \left\{ {1, \ldots ,10} \right\}\) is the rank of the decile portfolio, and \(p_{ij,t}\) \(\in \left( {0,1} \right)\) is the probability that stock \(i\) will be part of decile portfolio \(j\), while incorporating all the information available at the end of month \(t\).

Out-of-sample tests

Statistical predictive performance

Our primary interest is potential performance differences between traditional expected return-based portfolio formation and a classification-based approach. Therefore, we next compare measures of predictive performance. To set the highest possible bar for identifying incremental predictive performance, we use the best-performing machine learning method, i.e., the nn_1 model, as our relevant benchmark. Table 7 shows the confusion matrices, together with a broad set of classification measures.

Table 7 Classification performance of forecast portfolios with monthly restructuring

The entries in the diagonal of the confusion matrix indicate that, at an aggregate level, both models correctly classify stocks into decile portfolios. However, our SVM approach is slightly more accurate than the nn_1 architecture (12.63% vs. 12.10%), as well as marginally more sensitive (12.62% vs. 12.10%) and specific (90.29% and 90.23%). Compared with a 10.04% accuracy of a random classifier, the accuracy of both models is significantly higher. They work best for the top and bottom decile portfolios, and the patterns are slightly more pronounced for the short than the long side. Moreover, the SVM-based predictions differ less strongly from their realizations than those obtained from neural networks. This points towards a superior misclassification distribution, which might translate into an outperformance from an economic perspective. This is because SVMs are able to incorporate the inherent rank order of decile portfolios explicitly.

Investment portfolio performance

In a final step, recognizing Leitch and Tanner’s (1991) argument that the association between statistical predictive performance (classification measures) and economic profitability may be weak, we compare the cumulative performance of long-only investment portfolios. Table 8 shows the return and risk figures from Table 6 for the ols_pars, nn_1, and svm models. The cumulative performance metrics extend our earlier findings. Both models substantially outperform the benchmark model (ols_pars), while exhibiting similar return-risk behavior. Most importantly, the numbers of the SVM-based model are slightly superior to that of the neural network for both weighting schemes. We interpret these results to mean that a classification-based approach can be even better than the best-performing expected return-based portfolio formation.

Table 8 Cumulative performance of long-only forecast portfolios with monthly restructuring

Conclusion

Using market and fundamental data for European firms, we study the predictive performance of machine learning methods in forecasting stock returns, including approaches for variable selection/shrinkage or dimension reduction, tree-based models, and neural networks. We conduct a comparative analysis in terms of statistical and economic performance metrics. Following Gu et al. (2020), we enhance the set of twenty-two predictors in Drobetz et al. (2019) by two-way interactions as well as second- and third-order polynomials to capture nonlinearity. We confirm that interactions and nonlinear effects are important and add incremental predictive power.

Machine learning methods must be adequately trained and tuned to avoid overfitting. Each model’s degree of complexity varies substantially over time, and different models find similar predictors to be important. Despite these commonalities, the forecast models we study exhibit different statistical predictive performance (predictive slopes closer to 1, higher predictive \(R^{2}\) metrics, and positive DM test statistics). This further translates into markedly different economic profitability.

The return and risk figures of long-only investment strategies suggest that all forecast portfolios beat our linear benchmark model. The neural network architecture performs the best, also after accounting for transaction costs. Because it follows the “traditional” expected return-based portfolio formation, i.e., estimating stock-level expected returns first and aggregating stocks again into decile portfolios second, we compare its performance with a simpler, classification-based approach. We find that a support vector machine can classify stocks into decile portfolios that are even better than that obtained from neural networks. This leads to a more straightforward portfolio formation process that avoids some of the noise in stock-level excess returns, while maintaining the signal that is most important for a trading strategy, i.e., the correct classification into a decile portfolio.