Modelling Returns in US Housing Prices—You’re the One for Me, Fat Tails

: In this paper, we analysed the heavy-tailed behaviour in the dynamics of housing-price returns in the United States. We investigated the sources of heavy tails by estimating autoregressive models in which innovations can be subject to GARCH effects and/or non-Gaussianity. Using monthly data from January 1954 to September 2019, the properties of the models were assessed both within- and out-of-sample. We found strong evidence in favour of modelling both GARCH effects and non-Gaussianity. Accounting for these properties improves within-sample performance as well as point and


Introduction
In less than fifteen years, the world has experienced both a global financial crisis and a virus pandemic-both of which have had dire economic consequences. These events have confirmed the fact that large swings in economic variables happen more frequently than what is implied by models based on the traditional assumption of normally distributed disturbances. Put differently, the distributions of many variables are characterised by fat tails (Fagiolo et al. 2008;Ascari et al. 2015), that is, they have a higher probability mass in the tails of the distribution than a normal distribution does.
The purpose of this paper is to study the topic of fat tails related to a key US variable, namely, housing-price returns, an important macro-finance variable. Our analysis focuses on the dynamic properties of the process. We primarily aimed to assess what the appropriate distributional features of the innovations are if one aims to model the dynamics of the returns. We studied this issue using different autoregressive (AR) models, where we allowed for different sources of non-Gaussianity. We first conducted within-sample analysis and then validated our findings through an out-of-sample forecast exercise.
Our analysis was performed by employing monthly data on US housing-price returns from January 1954 to September 2019. We confirmed that the return series was characterised by fat tails and accordingly investigated its sources. We assessed the relevance of timevarying volatility-which can be a source of fat tails in the unconditional distribution of the variable-by estimating GARCH models. Whether error terms are drawn from distributions with fat tails was addressed by abandoning the assumption of a normal distribution for that of a Laplace distribution or a Student-t distribution. Finally, since the unconditional distribution of the data also appeared to be somewhat skewed, we also estimated the model under the assumption that error terms are drawn from a Skew-t distribution (defined as in Hansen 1994). After summarising our results, we found strong in-sample evidence for highly persistent, time-varying volatility and heavier-than-Gaussian

Related Literature
Housing prices are not only important as asset prices, but they are also highly relevant from a macroeconomic perspective since they have a close relation to the real economy. Several studies have documented their connection with the business cycle in general. Iacoviello (2005) used a general equilibrium monetary policy model with nominal loans and housing used as collateral. His main conclusion was that the collateral effect improves aggregate demand response to housing prices. In a similar spirit, Iacoviello and Neri (2010) showed-also using a general equilibrium framework-that housing-market spillovers are non-negligible and that these effects are concentrated on consumption rather than aggregate investment. Klarl (2016) employed wavelet (frequency) analysis to find that housing prices only lead the economic cycle during recessions, and shocks associated with the housing market are relatively short-lived.
The relationship between housing wealth and consumption has received particular attention. Catte et al. (2004) focused on factors of housing-price variability. By estimating error correction models for OECD countries, they found that both macroeconomic and structural factors contribute to housing-price variability and that the propensity to consume depends on how developed the mortgage market is in a given country. Campbell and Cocco (2007) analysed household data from the United Kingdom and found heterogeneity in the propensity of consumption to housing wealth. In particular, older households are more sensitive to housing-price changes and regional housing-price variation is more important than national variation. A similar conclusion was reached by Martín-Legendre et al. (2019) using Spanish household survey data, who also looked at other asset classes and found that the propensity to consume is highest for housing equity. Acolin (2020) found time variation in the relationship between housing prices and consumption based on survey data from the United States. In particular, the propensity to consume increases right before the financial crisis and decreases afterwards. He argues that these changes are associated with the availability of tools to extract wealth from home equity. The importance of wealth extraction is also pointed out by Pan and Wu (2021), who analysed a panel of Chinese homeowners and found that the propensity to consume from housing wealth concentrates on second houses, and not on primary residences.
There are a number of studies that document non-Gaussian behaviour in real estateprice changes. These papers mostly document the presence of skewness and kurtosis in the unconditional distribution of both nominal and real (adjusted for inflation) housingprice returns, which is important from a portfolio risk-management perspective where housing equity is considered as a financial asset; see, for example, Foster and Van Order (1984) who used stable distribution in prices to model volatility and default risk. Myer and Webb (1994) considered both real and nominal returns at a monthly, quarterly and annual frequency. They found that non-Gaussianity is stronger in nominal returns and at a quarterly (rather than monthly or annual) frequency. Young and Graff (1995) applied stable distributions for commercial real estate indices in the United States and found that models with infinite variance (very heavy tails) are better suited than models based on the Gaussian distribution. A similar analysis was performed by Graff et al. (1997) for Australian data and Young et al. (2006) for data from the United Kingdom, with the conclusion that infinite variance models fit the data in the respective countries better. For Germany, Maurer et al. (2004) analysed returns on real estate fund indices and found relatively less kurtosis and skewness than in the United States. Leung et al. (2006) analysed the macroeconomic factors affecting the second moment and skewness of housing returns and found relatively weak support for macroeconomic factors to drive housing-price dispersion and skewness. Pontines (2010) found evidence for fat tails in 16 OECD countries and showed that the Student t-distribution is adequate to fit almost all data. Chiang et al. (2012) argue that value-at-risk based on the Gaussian distribution is inappropriate since the downside risk of housing prices is larger than implied by the Gaussian distribution; it is actually described well by a so-called generalised Pareto distribution for both real and nominal returns in the United States and the United Kingdom.
Our analysis with respect to non-Gaussianity focuses on the dynamic properties of the housing-price process; in that sense, it is related to Dufitinema (2021), who assessed Bayesian stochastic volatility models and found that the model with heavy tails performs the best in terms of out-of-sample forecasts for several Finnish housing-price series. Dynamic interactions were also the focus of McMillan (2012), who considered the long-run relationship between stock prices and housing prices in the Unites States and the United Kingdom with an error correction model. He found that the adjustment to equilibrium is non-linear; large deviations must occur before stock prices revert, while housing prices hardly adjust at all. Musso et al. (2011) estimated a structural VAR for housing prices and business-cycle variables and argue that housing-price shocks are not identified using the standard Cholesky decomposition in a model with Gaussian disturbances. Ahamada and Sanchez (2013) assessed the role of housing-price dynamics in a small-scale macroeconomic VAR model and found evidence that the effect of housing-prices on economic activity intensified during the 1980s.
Housing-price dynamics are also affected by market efficiency and potential nonrational behaviour. Case and Shiller (1989) were pioneers in documenting that when housing markets are inefficient, there is a positive autocorrelation in housing returns and real interest rate information does not seem to be fully incorporated in house prices. Granziera and Kozicki (2015) solved an asset pricing model for housing prices with bubbles created by waves of over-optimism, and they calibrated the model to data in the United States. Their results show that this model captures housing-price dynamics better than the rational expectations model. Analysing household survey data, Ma (2020) found support for non-rational behavior. In particular, households' housing-price expectations tend to exhibit momentum (positive serial correlation), and no significant reversion to fundamentals.
Lastly, our paper contributes to the literature on non-Gaussianity in macroeconomic variables. This issue has received increasing attention in the aftermath of the financial crisis and the recent economic downturn associated with the coronavirus pandemic. Fitting exponential-power distributions, Fagiolo et al. (2008) found that the output-growth in several OECD countries has heavier than normal tails, even after controlling for autocorrelation, heteroscedasticity and outliers. Related to that, Ascari et al. (2015) simulated standard new-Keynesian and real business cycle models, and they suggest that these models may prove to be unable to generate the heavy tails in the unconditional distribution documented in output growth. Skewness-that is, asymmetry-has also been discussed for macroeconomic variables. Neftci (1984) has documented that the unemployment rate is asymmetric by estimating a finite state Markov process. Time series evidence based on aggregate data in Acemoglu and Scott (1997) also suggests that output in the United States is best captured by non-linear, asymmetric models where sharper downturns (a heavier left tail) are allowed for. Lastly, Bekaert and Popov (2019) used panel data from 110 countries to document a positive cross-sectional correlation between the volatility and skewness of output growth.

Data
We based our analysis on Shiller's (2015) housing-price index. In our main analysis, we used nominal housing prices, in line with several previous studies, including Young and Graff (1995); Young et al. (2006); Pontines (2010) and Ma (2020). Nominal returns are also the focus of some important surveys related to housing prices, such as the University of Michigan's Survey of Consumers; for analysis based on this survey, see, for example, Piazzesi and Schneider (2009) and Hjalmarsson and Österholm (2020). However, since real housing returns (adjusted for inflation) are also of interest and have been analysed in the literature, we also consider real returns in Appendix B; 1 both nominal and real returns are considered in, for example, Case and Shiller (1989); Myer and Webb (1994); Maurer et al. (2004); Chiang et al. (2012) and Jordá et al. (2019).
Data are monthly and span the period from January 1953 to September 2019. Nominal returns are calculated as r t = P t −P t−12 P t−12 , where P t is the price index at time t. A time-series plot of the return data is given in Figure 1 and a histogram showing the unconditional distribution is provided in Figure 2. Some key descriptive statistics are given in Table 1.
J. Risk Financial Manag. 2021, 14, x FOR PEER REVIEW 4 of 18 discussed for macroeconomic variables. Neftci (1984) has documented that the unemployment rate is asymmetric by estimating a finite state Markov process. Time series evidence based on aggregate data in Acemoglu and Scott (1997) also suggests that output in the United States is best captured by non-linear, asymmetric models where sharper downturns (a heavier left tail) are allowed for. Lastly, Bekaert and Popov (2019) used panel data from 110 countries to document a positive cross-sectional correlation between the volatility and skewness of output growth.

Data
We based our analysis on Shiller's (2015) housing-price index. In our main analysis, we used nominal housing prices, in line with several previous studies, including Young and Graff (1995); Young et al. (2006); Pontines (2010) and Ma (2020). Nominal returns are also the focus of some important surveys related to housing prices, such as the University of Michigan's Survey of Consumers; for analysis based on this survey, see, for example, Piazzesi and Schneider (2009) and Hjalmarsson and Österholm (2020). However, since real housing returns (adjusted for inflation) are also of interest and have been analysed in the literature, we also consider real returns in Appendix B; 1 both nominal and real returns are considered in, for example, Case and Shiller (1989); Myer and Webb (1994) time-series plot of the return data is given in Figure 1 and a histogram showing the unconditional distribution is provided in Figure 2. Some key descriptive statistics are given in Table 1.     Note: N is the number of observations. The critical value at the five percent level of the Jarque-Bera test is 5.99. The Augmented Dickey-Fuller (ADF) test is performed with four lagged differences in the test equation (to be consistent with the selected lag length for the empirical model considered in Section 5). Both unit-root tests are performed under the assumption that there is no deterministic trend in the data (the test equation has an intercept but no time trend). The critical value at the five percent level for the ADF and KPSS test is −2.866 and 0.463, respectively.
As both the histogram in Figure 2 and the descriptive statistics in Table 1 suggest, the tails of the distribution appear heavier than what is implied by a normal distribution; the kurtosis of the unconditional distribution is around 3.8. There is also a slight negative skewness. Taken together, this means that the Jarque-Bera test strongly rejects an assumption of normality. Furthermore, both the Augmented Dickey-Fuller test and the KPSS test strongly suggest that the return series does not contain a unit root.
In Table 2, we compare the unconditional housing-price returns and a normally distributed variable using the kurtosis ( , ), peakedness ( , ) and tailness ( , ) measures proposed by Liu (2019). These measures are based on the ratio of interquantile ranges of the variables, using various quantiles of the distribution denoted by , , and (for example, = 0.05 means that we use the 5th percentile of the distribution when calculating kurtosis and tailness). 2 Of particular importance are the tails of the return distribution, which are heavier than those of the normal distribution at all levels. There is also some evidence for asymmetry in the unconditional distribution further out in the tails (α = 0.01), but skewness is not a robust feature if we use different quantiles (α). As both the histogram in Figure 2 and the descriptive statistics in Table 1 suggest, the tails of the distribution appear heavier than what is implied by a normal distribution; the kurtosis of the unconditional distribution is around 3.8. There is also a slight negative skewness. Taken together, this means that the Jarque-Bera test strongly rejects an assumption of normality. Furthermore, both the Augmented Dickey-Fuller test and the KPSS test strongly suggest that the return series does not contain a unit root.
In Table 2, we compare the unconditional housing-price returns and a normally distributed variable using the kurtosis (K α,τ ), peakedness (P η,τ ) and tailness (T α,η ) measures proposed by Liu (2019). These measures are based on the ratio of interquantile ranges of the variables, using various quantiles of the distribution denoted by α, η, and τ (for example, α = 0.05 means that we use the 5th percentile of the distribution when calculating kurtosis and tailness). 2 Of particular importance are the tails of the return distribution, which are heavier than those of the normal distribution at all levels. There is also some evidence for asymmetry in the unconditional distribution further out in the tails (α = 0.01), but skewness is not a robust feature if we use different quantiles (α).

Methodological Framework
Since the focus of our analysis was to model the properties of the housing-return series, we used univariate time series to capture the dynamics of the returns. We used AR models because they typically perform well in macroeconomic forecasting (e.g., Faust and Wright 2009), despite their simplicity and parsimony. Our baseline model is the homoscedastic univariate AR(p) model: where ν t is assumed to be an iid normal error term, ν t ∼ N(0, 1). The number of lags is determined using the Schwarz (1978) information criterion, which suggests a lag length of p = 5; for comparability we employed the same number of lags across all model specifications.
We then allowed for a more flexible error term in the model. First, we let the second moment vary over time. Doing that, we choose a robust approach and rely on the GARCH(1,1) specification (Bollerslev 1986), with a normally distributed error term. This is a common modelling choice-see Chung et al. (2012) and Clark and Ravazzolo (2015) among others for macroeconomic variables-and for financial data, a GARCH(1,1) model is fairly difficult to outperform (Hansen and Lunde 2005). Hence, the second model was the AR(p)-GARCH(1,1): where v t , as above, is assumed to be an iid, normal error term. In the third and fourth models, we relax the assumption of normality of the error term, while maintaining symmetry. This is done by using ν t ∼ Laplace and ν t ∼ Student − t(κ), where κ is the degrees of freedom. Of specific interest here is the fact that both the Laplace distribution and the Student-t distribution allow for heavier-than-Gaussian tails. 3 Finally, in the fifth model, we assume that the error terms are drawn from the Skew-t distribution (Hansen 1994 where κ is the degrees of freedom and λ is the skewness parameter. This means that we allowed for both heavy tails and asymmetry in the error term. 4 Our primary focus was to evaluate which distributional properties are important for modelling the dynamics of the housing-price return series. We evaluated this based on both within-sample estimates (using maximum likelihood) and an out-of-sample forecasting exercise; in the latter, we considered both point and density forecasts. The point forecasts were evaluated based on the root mean square error (RMSE) of the forecast, accompanied by the Diebold-Mariano test (Diebold and Mariano 1995). We formulated the test such that a positive test value implies that the given model outperforms the benchmark. 5 In terms of density forecast evaluation, we made use of the probability integral transform (PIT) proposed by Diebold et al. (1998). We looked at the histograms of the PIT and visually assessed how close they are to the uniform distribution. We also employed more rigorous statistical techniques such as the Kullbach-Leibler (KL) divergence (Kullback and Leibler 1951) of the PIT series from an iid uniform distribution and formal statistical tests of uniformity, namely, the Kolmogorov-Smirnov (KS) and the Anderson-Darling (AD) test (Anderson and Darling 1952). 6 For all these measures, a lower value implies that the given model produces a better density forecast.

Empirical Analysis
We start by presenting within-sample estimation results for all five specifications. 7,8 After that, we turn to an out-of-sample forecast exercise. In the forecast exercise, we focus on the one-period-ahead horizon. The main reason for this is that results for that horizon theoretically are most closely related to within-sample findings and can hence be considered a validation tool. Table 3 presents the estimation results for the five specifications. Note that the housingprice return series is fairly persistent (the sum of the autoregressive coefficients are close to unity). Engle's (1982) ARCH test shows the strong presence of conditional heteroscedasticity for the residuals in the baseline model in column 1. Also, the Jarque-Bera test suggests that residuals are non-Gaussian in this specification. This is further supported by Figure 3, where it is clear that a t-distribution fits the residuals of the baseline specification better than a normal distribution.  (3), (4) and (5) employ a GARCH(1,1) specification together with Laplace, t-, and Skew-t-distributed error terms, respectively. The ARCH-test is Engle's test for conditional heteroscedasticity (Engle 1982) conducted with twelve lags. The JB-test is the Jarque-Bera test for conditional heteroscedasticity. " a " indicates that the test is based on the probability integral transform of the (standardised) residuals (see footnote 10). p-values are in parentheses (); standard errors for the degrees of freedom of the t-distribution are in brackets []. "N" is the number of observations.

Within-Sample Estimation Results
The GARCH(1,1) specification (column 2) seems to take care of conditional heteroscedasticity quite well. However, the variance is very persistent and its parameters are not precisely estimated. 9 For the specification with Laplace distribution, where no extra parameter is added to capture heavy-tails, the persistence of the variance is somewhat lower. However, it seems that the Laplace distribution is not flexible enough to capture the behaviour of the standardised residuals, as their transformation by imposing a normal distribution still shows signs of non-Gaussianity according to the Jarque-Bera test. 10 In contrast, the more flexible approaches, the t and the Skew-t distribution, seem appropriate to capture non-Gaussianity. However, the skewness parameter is low and non-significant (column 5). specification together with Laplace, t-, and Skew-t-distributed error terms, respectively. The ARCH-test is Engle's test for conditional heteroscedasticity (Engle 1982) conducted with twelve lags. The JB-test is the Jarque-Bera test for conditional heteroscedasticity. " a " indicates that the test is based on the probability integral transform of the (standardised) residuals (see footnote 10). pvalues are in parentheses (); standard errors for the degrees of freedom of the t-distribution are in brackets []. "N" is the number of observations. The GARCH(1,1) specification (column 2) seems to take care of conditional heteroscedasticity quite well. However, the variance is very persistent and its parameters are not precisely estimated. 9 For the specification with Laplace distribution, where no extra parameter is added to capture heavy-tails, the persistence of the variance is somewhat lower. However, it seems that the Laplace distribution is not flexible enough to capture the behaviour of the standardised residuals, as their transformation by imposing a normal distribution still shows signs of non-Gaussianity according to the Jarque-Bera test. 10 In contrast, the more flexible approaches, the t and the Skew-t distribution, seem appropriate to capture non-Gaussianity. However, the skewness parameter is low and non-significant (column 5). Figure 4 shows the estimated conditional volatility of the return series. All GARCH specifications look quite similar; the biggest discrepancy is found for the Laplace, which due to its excessively heavy tails allows for a generally lower variance. Volatility also clusters; the series starts out with a higher general level of volatility, is considerably lower between 1980 and 2005, and peaks again during the financial crisis.
Having assessed the different specifications within-sample, we next conducted an out-of-sample forecast exercise to validate our results. If we can establish with this exercise that a more complex specification also generates better out-of-sample forecasts, we can conclude that our results are not due to over-fitting the data within-sample.  Figure 4 shows the estimated conditional volatility of the return series. All GARCH specifications look quite similar; the biggest discrepancy is found for the Laplace, which due to its excessively heavy tails allows for a generally lower variance. Volatility also clusters; the series starts out with a higher general level of volatility, is considerably lower between 1980 and 2005, and peaks again during the financial crisis.

Out-of-Sample Analysis
We evaluated the one-period-ahead forecasting performance of the models using an expanding window technique starting in January 1974; this yielded F = 548 forecasts to evaluate. 11 The results are collected in Table 4 for both point and density forecast measures.  Having assessed the different specifications within-sample, we next conducted an out-of-sample forecast exercise to validate our results. If we can establish with this exercise that a more complex specification also generates better out-of-sample forecasts, we can conclude that our results are not due to over-fitting the data within-sample.

Out-of-Sample Analysis
We evaluated the one-period-ahead forecasting performance of the models using an expanding window technique starting in January 1974; this yielded F = 548 forecasts to evaluate. 11 The results are collected in Table 4 for both point and density forecast measures. For point forecasts, the homoscedastic model clearly performs worst. Accounting for time-varying volatility statistically significantly improves the forecast, as can be seen from the Diebold-Mariano test. However, allowing for non-Gaussian error terms does not provide additional improvement in point forecasts over the Gaussian GARCH specification.
For density forecasts, the homoscedastic benchmark model again appears to have the weakest performance. Not only are all three evaluation measures largest for this model, the KS and AD tests reject uniformity of the PIT series at the five percent level. Comparing the results in Table 4, allowing for GARCH effects (coupled with normally distributed error terms) improves quite a lot on the density forecasts such that the formal tests no longer reject the null of uniformity. Additional improvements are generated by allowing for heavier tails in the error distribution using a t-distributed error term. Further generalising the error distribution to a Skew-t distribution appears to be less important, although all three density forecast measures are lowest in case of the specification with skewed error terms. Using the Laplace distribution to capture non-Gaussianity seems to produce the least favourable results (with the KS and AD tests rejecting uniformity), which is consistent with the within-sample findings.
To gain further insight into the out-of-sample results, the histograms of the PIT for each specification are presented in Figure A1 in Appendix A. The graph of the homoscedastic model (top left panel in Figure A1) is peaked, implying that the homoscedastic, Gaussian model's predictive density is too wide. 12 Allowing for GARCH effects-while still assuming a Gaussian error term-eliminates this problem and produces PITs whose distribution is fairly close to uniform (top right panel in Figure A1). The Laplace distribution is less successful at capturing the tail observations (second row, left panel in Figure A1), while the Student-t, and Skew-t distributed error terms appear most successful in bringing PITs close to uniformity (bottom left and right panels in Figure A1, respectively).

Discussion of the Results
Our aim with the current study was to assess whether non-Gaussianity in housing returns is important when it comes to modelling the dynamics of the series (which is also related to point and density forecasting). Our results clearly point to the conclusion that non-Gaussianity does matter. The in-sample evidence shows that we need to use distributions that are different from the Gaussian to capture the behaviour of the innovations; the model with conditionally heteroscedastic, but Gaussian error terms has standardised innovations that are non-normal, as evidenced by the Jarque-Bera test. Furthermore, we found a need for flexible modelling of the distribution: either a t-distribution or a Skew-t distribution (both of which parametrise the degree of non-Gaussianity) can capture the shape of the innovation distributions appropriately.
Point forecasts are mostly unaffected by flexible modelling of innovations. This is unsurprising since these additional features introduce more parameters (and hence potentially lower precision) without any changes in the structure of the mean equation. In contrast, the best density forecasts are produced by the models with tor skew-t innovations (although GARCH effects seem to be most relevant for density forecasts). This underlines the importance of non-Gaussianity in situations where the forecaster is interested not only in the first moment, but the entire predictive distribution. A prime example is fan charts used in monetary policy decision making and communication; see, for example, . Another example is "growth-at-risk", which is a framework for providing statements concerning the probability of poor outcomes for GDP growth; see, for example, Adrian et al. (2018); Adrian et al. (2019) and Prasad et al. (2019). Growth-at-risk is getting increasing attention from institutions such as the IMF and central banks.
Our results further strengthen the message of the literature (Myer and Webb 1994;Maurer et al. 2004;Young et al. 2006 andChiang et al. 2012 among others) that non-Gaussianity is a relevant feature of housing returns. These papers, analysing the unconditional distribution of the variables, provide a very important foundation for our analysis, to which we make additions. By modelling the dynamics and allowing for heteroscedasticity, we established whether unconditional non-Gaussianity is caused by conditional heteroscedasticity alone or if innovations are also non-Gaussian. After finding evidence for the latter, we concluded that allowing for time varying second moments is not sufficient; one should also allow for non-Gaussian innovations. 13 The importance of non-Gaussian innovations for out-of-sample forecasting is a finding in line with Dufitinema (2021). However, in this paper we go further, as we document the importance of non-Gaussianity for density forecasts as well.

Conclusions
Fat tails in housing-price returns have been established in several studies. We contribute to this literature by analysing heavy-tailed behaviour in the context of modelling and forecasting the return series. We found evidence of both conditional heteroscedasticity and non-Gaussian behaviour. Non-Gaussianity mostly takes the form of excess kurtosis, while the support for skewed behaviour is weaker. We also found that accounting for these features helps to improve both point and density forecasts.
Our results highlight the importance of underlying distributional assumptions when modelling housing-price returns. In addition, our findings point to the relevance of considering time-varying volatility when modelling macroeconomic time series; after all, housing-price returns are not only a financial variable but also an important macroeconomic variable. While this aspect of macroeconomic and macro-finance modelling has grown in popularity over the last fifteen years-after having been made popular by  and Primiceri (2005)-we nevertheless believe that it has not been given sufficient attention.
Author Contributions: The authors contributed equally to the manuscript. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
There is no conflict of interest to disclose.

Appendix A
J. Risk Financial Manag. 2021, 14, x FOR PEER REVIEW 12 of 18 Figure A1. Histograms of probability integral transform for out-of-sample forecasts. Note: Relative frequency on vertical axis.  Figure A1. Histograms of probability integral transform for out-of-sample forecasts. Note: Relative frequency on vertical axis.  (3), (4) and (5) employ a GARCH(1,1) specification together with Laplace, t-, and Skew-t-distributed error terms, respectively. The ARCH-test is Engle's test for conditional heteroscedasticity (Engle 1982) conducted with twelve lags. The JB-test is the Jarque-Bera test for conditional heteroscedasticity. " a " indicates that the test is based on the probability integral transform of the (standardised) residuals (see footnote 10). p-values are in parentheses (); standard errors for the degrees of freedom of the t-distribution are in brackets []."N" is the number of observations. Appendix B    (3), (4) and (5) employ a GARCH(1,1) specification together with Laplace, t-, and Skew-t-distributed error terms, respectively. The ARCH-test is Engle's test for conditional heteroscedasticity (Engle 1982) conducted with twelve lags. The JB-test is the Jarque-Bera test for conditional heteroscedasticity. " a " indicates that the test is based on the probability integral transform of the (standardised) residuals (see footnote 10). p-values are in parentheses (); standard errors for the degrees of freedom of the t-distribution are in brackets []. "N" is the number of observations.  Figure A2. Histograms of probability integral transform for out-of-sample forecasts-real returns. Note: Relative frequency on vertical axis.

1
The real prices are Shiller's (2015); the CPI has been used as the deflator. As can be seen when comparing our main results to those in Appendix B, the results using real returns are qualitatively similar. Most importantly, we found that the Student-t GARCH and Skew-t GARCH specifications are the only ones whose residuals pass the Jarque-Bera test for normality. These two models also have the best out-of-sample forecast performance. Figure A2. Histograms of probability integral transform for out-of-sample forecasts-real returns. Note: Relative frequency on vertical axis.

1
The real prices are Shiller's (2015); the CPI has been used as the deflator. As can be seen when comparing our main results to those in Appendix B, the results using real returns are qualitatively similar. Most importantly, we found that the Student-t GARCH and Skew-t GARCH specifications are the only ones whose residuals pass the Jarque-Bera test for normality. These two models also have the best out-of-sample forecast performance.

2
The measures are related to each other and the quantiles of the distribution by the formula where Q i is the i-th quantile of the distribution and the quantiles used satisfy 0 < α < η < τ < 0.5. Further, the tail (T) can be decomposed to right (RT) and left (LT) tails according to Fagiolo et al. (2008) found the Laplace distribution useful when modelling the fat tails of GDP growth rates. The Student-t distribution has been used more widely in the empirical literature; see, for example, Cúrdia et al. (2014); Clark and Ravazzolo (2015); Cross and Poon (2016); and Kiss and Österholm (2020). 4 All distributions are parameterised to have a unit variance. The density function of the Skew-t distribution can be found in Hansen (1994). 5 As pointed out by Diebold (2015), the Diebold and Mariano (1995) test in its standard form is a reasonable choice even if we employ nested models, for which the original assumptions of the test do not formally hold. This is further supported by the fact that the test performs relatively well in such larger training and evaluation samples as the one we use in our analysis (Clark and McCracken 2013). 6 The KL divergence has been used in a number of applications in the context of density forecasts-typically though with a somewhat different focus; see, for example, ; Robertson et al. (2005); Hall and Mitchell (2007); Diks et al. (2010) and Mitchell and Wallis (2011).
7 Figure 1 suggests that the time-series behaviour of the return series may have changed around 1975. In fact, the Shiller (2015) dataset changes source in January 1975. Therefore, we also estimate the model on the shorter subsample, namely, January 1975 to September 2019. The results collected in Table A1 in Appendix A are qualitatively very similar to full sample estimates. 8 We also assessed the robustness of our results to the presence of different regimes, in particular the boom and bust cycle between 2002 and 2009. We did this by allowing for a different constant and dynamics in the mean equation during this period, using a time dummy and interactions. Unreported results (available upon request from the authors) show very similar results to our baseline. Capturing non-Gaussianity in the innovations remains a salient and important feature of the model. 9 In fact, the parameters α 1 and α 2 sum up to unity for the normal-GARCH, Student-t-GARCH and Skew-t-GARCH specifications. However, looking at the results using the Laplace distribution and the shorter sample, integrated volatility does not seem to be a robust feature of the data, therefore we do not impose it in any of the specifications with conditional heteroscedasticity.

10
The Jarque-Bera tests are based on z t , the PIT series of the standardised residuals, for which Φ −1 (z t ) is standard normally distributed (where Φ −1 is the inverse cumulative distribution function of the standard normal distribution). We test this by applying the Jarque-Bera test on Φ −1 (z t ).

11
That is, we first estimate the models on the sample January 1953 to January 1974 and make predictions for February 1974. We then expand the sample to January 1953 to February 1974, re-estimate the models and predict March 1974. We continue in this manner until we reach the end of the sample, where we estimate the models using data from January 1953 to August 2019 and make predictions for September 2019.