Scenario-Based Stress Tests: Are They Painful Enough?

Forecasts, models and stress tests are important tools for policymakers and business planners. Recent developments in these related spheres have seen greater emphasis placed on stress tests from a regulatory perspective, while at the same time forecasting performance has been criticized. Given the interlinkages between the two, similar limitations apply to stress tests as to forecasts and should be borne in mind by practitioners. In addition, the recent evolution of stress tests, and in particular the increasing popularity of scenario-based approaches, raises concerns about how well the shortcomings of the associated models are understood. This includes estimated stress cases relative to base cases – the degree of pain – that simple scenario modelling approaches engender. This paper illustrates this phenomenon using simulation techniques and demonstrates that more extreme stress scenarios need to be employed in order to match the inference from simple value-at-risk approaches. Alternatively, complex modelling approaches can address this concern, but are not widely used to date. Some policymakers seem to be aware of these issues, judging by the severity of some recent stress scenarios.

As such, the degree of stress imposed by scenariobased approaches may be relatively less punitive -or painful -than simple historical approaches, when both are conducted on the same basis. In part, this may be one reason why regulatory authorities have focused on extreme scenarios to mitigate this issue.
The remainder of the paper is structured as follows. Section 2 provides some broad background on the linkages between models, forecasts and stress tests.
Section 3 then outlines the evolving role of stress testing in the banking sector, including Basel-related regulatory requirements. Section 4 briefly details and compares alternative stress testing methods from previous research. Section 5 presents new simulations that compare and contrast scenario-based and value-at-risk stress testing approaches. Finally, Section 6 concludes.

Models, forecasts and stress tests: their uses and misuses
Economists make a living by analyzing prices, quantities or some other form of economic or financial variable. One type of tool often used for such analysis is a model. This can take many different forms: a model can often be a set of algebraic equations, for instance, describing how different factors such as demand and inflation interact. For econometricians, models have taken increasingly complex statistical forms, both to establish where they obviously fail -instances commonly referred to as errors or residuals -and exactly how to interpret those failures. But more generally, models can be diagrams, spreadsheets or complex interlinked calculations: there are no broad restrictions about the form that different models can take.
However, all economic and financial modelswhether qualitative, statistical or algebraic -share one key feature. They are necessarily simplifications of how the real world works (Bank of England, 2003).
The financial and economic linkages between companies, banks and governments, both within and across countries, are so complex that no model can completely (or arguably even adequately) capture them. Trying to do so typically results in a model that does not work in some fashion, for instance, because it is too complex to solve or because it fails to match perceptions of prior experience. More generally, simply including more data, or adding more complexity, is no guarantee of improving model or forecasting performance, although in certain circumstances it can prove very powerful (Bernanke, Boivin, & Eliasz, 2005;Mumtaz, Zabczyk, & Ellis 2009). Pagan (2003) provides a useful summary of the linkages between models and forecasting.
Instead, it is important to recognize the true role of models: they are essentially useful tools for consistently processing information. That consistency in turn makes statistically based models in particular very useful for large-scale simulations, where multiple Monte Carlo or bootstrap simulations are used to test and infer the properties of the model (Ellis, 2006). However, the real world is rarely completely consistent, so, even armed with these tools, analytical judgement will always be critical both in building and using models. Models are useful analytical tools but do not mechanically give 'the answer' .
Forecasting is one area where analytical judgement is key. Economists, meteorologists and other forecasters cannot solely rely on mechanical processes but need to apply their own insights and judgement (Coletti, Hunt, Rose, & Tetlow, 1996). Because models are necessarily simplifications of the real world -and the related identification of shocks is inexact, particularly in real-time given underlying data uncertainty issues (Castle & Ellis, 2002;Croushore & Stark, 2001) -they will often miss specific issues or linkages, or potentially overstate them. Here again judgement is required before attempting to 'fill the gaps' in the model. It can often seem like these interventions make things worse, rather than better. Economists' forecasting record both during and since the financial crisis has been less than stellar. There remains considerable 'clustering' , where point forecasts for data series from different forecasters tend to bunch together over time, and forecasters often assume that key macroeconomic variables such as growth and inflation revert to trend too quickly (Pain, Lewis, Dang, Jin, & Richardson, 2014), although views about trend may be more disparate now than prior to the crisis. Faced with this poor performance, it can be tempting to disregard forecasts altogether, and there is evidence that the general public's expectations are often far removed from those of policymakers (Moessner, Zhu, & Ellis, 2011 should reflect the forecasters' best guess of what will happen over the coming months or years, but that forecast will almost certainly be precisely wrong, even if it is broadly right. The fundamental reason for this is that shocks hit the economy, or indeed a company's cash flow, all the time, and shocks -as the name suggests -are by definition technically unpredictable, being random in terms of both incidence and magnitude.
Shocks are no more predictable than exactly which lottery numbers will emerge in each draw. The best a forecast can do is express how past shocks will affect a business or sector over time.
If shocks make point forecasts inaccurate, what is the point of forecasting? Forecasts are important because they help economists identify (future) shocks as they materialize; they also help to refine our understanding of economic and financial relationships and transmission mechanisms. If GDP growth is forecast to be 3% next year, and it comes in at 4%, the forecast was wrong. Either the forecaster will have misjudged the impact of past shocks, or new shocks have hit the economy. Distinguishing between these two outcomes is difficult, but starting from 3% at least quantifies the 'news' in the growth data, providing a gauge of how wrong the forecaster was. This has parallels with the time series modelling used by Blanchard & Fisher (1989) and implemented for instance by Flood & Lowe (1995) in an inventory modelling context to distinguish between expected and unexpected changes in demand.
This quantification is a critical step, in part, because it informs how future forecasts may need to be adjusted or refined. Without being clear about what was expected -the benchmark or counterfactual -we cannot begin to identify and quantify new or unexpected developments. Uncertainty will always be inherent in forecasts, but these same forecasts are still critical for any meaningful analysis of events.
Partly because of this inherent uncertainty, some forecasters deliberately choose to present their views as a range of different outcomes rather than simply presenting point forecasts for key data series. This can include distinct scenarios -for instance, poor, central or good scenarios -or potentially even illustrating the range of uncertainty around forecasts by plotting probability-based distributions of outturns. The Bank of England was a forerunner here with its famous 'fan charts' , as described in Britton, Fisher and Whitley (1998). These explicit distributions can be useful insofar as they tell us more about the assumed balance of risks, or forecasters' views about the uncertainty around their forecasts. Having a distribution of possible future outcomes -including the mode and the mean -is clearly more informative than simply providing point forecasts. However, ultimately even fan charts will also be wrong and fail to match the actual distribution of data over time. For instance, prior to the financial crisis, the Bank of England was criticized by Wallis (2004) for publishing fan charts that were too broad. However, the crisis exposed that the fan charts were actually not wide enough (see Bank of England, 2009 However, if point forecasts -best guesses -are nearly always precisely wrong in terms of predicting what will happen, then there is no reason to think that stress tests -literally unexpected outturns -will be any better as a guide to tail events. This is again because of the unpredictable nature and transmission of shocks, as noted above, which may be even more pronounced for tail events. As such, an odd juxtaposition has arisen between the increased reliance on stress testing by regulators, and reduced confidence in central forecasts. In part, this may reflect the changing nature of many of the stress tests and models used, as discussed in the next section. While these and other developments represent important changes in the regulatory framework for banking, they are beyond the scope of this paper. Instead, the focus of this paper is limited to comparing and contrasting aspects of two particular methods of stress testing that happen to be frequently applied in the banking sector. In this context, it is most useful to focus and limit the initial discussion of stress testing to credit risk, as that aspect is one of the most directly applicable in terms of comparing and contrasting different risk assessment approaches. However, the differences discussed herein hold significant relevance for the broader stress testing sphere, rather than just being applicable to either to credit risks or indeed just to banks. As such, while the approach adopted herein is deliberately simplistic, it has potentially far-reaching implications for stress testing more generally. It Typically, under the original Basel II framework banks were required to hold capital to cover unexpected loss-es associated with 99.9% VaR over a one-year horizon;

Stress tests: an evolving regulatory landscape for banks
in other words, capital buffers should be high enough to cover all but the most extreme of unexpected losses.
Because this approach specifies the entire distribution of losses, the stressed loss -literally the expected loss plus the further unexpected loss, at a given level of stress -can be expressed as a 'multiple' of the expected loss:

Stressed losses Multiple
Expected losses = (1) This idea is developed by Moody's (2014) and serves as a useful gauge of how 'stressful' a given outcome is: the higher the multiple, the higher the stress. When EAD is the same in both instances, the multiple is simply the ratio of the stressed and expected loss rates.
VaR analysis offers some substantial advantages, such as its practical viability and conceptual attractiveness, as presented by Kupiec (1998) among others from a historical context, and the ability to consider and contrast multiple models and calibrations, for  instance, as demonstrated by Alexander and Sheedy (2008) in the context of currency pairings. Unfortunately, however, VaR analysis spectacularly failed to act as a useful guide to the losses that banks suffered during the financial crisis. In part, this likely reflected modelling and data limitations where risks were calculated on insufficient data or time series that did not incorporate tail events. Figure 1 presents an illustration; based on the previous time series for US banks' residential mortgage charge-offs prior to the financial crisis, simple distribution assumptions akin to a VaRbased approach would have suggested relatively low loss rates during a tail event. This is a deliberately simplistic example: pre-crisis stress tests were not all based on such simple historical observation, and many stress tests are hypothetical rather than historical. However, it is clear from this example that, with hindsight, these types of backward-looking stress tests were far from exacting and greatly underestimated the damage done to banks' balance sheets from the subprime crisis. A key advantage of the scenario-based approach is that it offers a compelling 'narrative hook' for nontechnicians. In a banking context, the supervisor or the banks themselves can then try to work out what this 'scenario' would imply for credit losses, net interest income, and other relevant determinants of bank solvency. This point was first noted by Lopez (2005) in terms of linking potential losses to a 'specific and concrete' set of events. However, the specificity and solidity of said events is far from assured given the forward-looking and ultimately hypothetical nature of scenario-based stress tests; to date, no scenario posited in regulatory stress tests has materialized with the effects implied in the stress test. (As noted earlier, this is not surprising given that the exercise seeks to predict the unexpected.) At the same time, the narrative hook can give non-technicians unwarranted comfort about the stress testing approach if the transmission mechanism between the scenario and stressed outcomes -i.e., the implicit modelling approach -is ignored. In fact, scenario-based stress tests are every bit as subject to the modelling uncertainty that is inherent in forecasts, as described in the previous section of this paper.

Developing alternative stress test approaches: a brief literature review
There have been other criticisms of stress tests based on macroeconomic scenarios; Borio, Drehmann and Tsatsaronis (2012) provide a useful summary, noting that they are not suitable early warning devices and would benefit from complementary information.
Concerns have also been raised about the appropriate modelling framework. In many instances, the models that are used to construct the scenario, or relate it for instance to banks' credit losses, will often have been built around the center of the distribution (Bunn, Cun-ningham, & Drehmann, 2005;Hoggarth, Sorensen, & Zicchino, 2005). Typically, we might estimate the relationship between default rates and macroeconomic variables over a number of years and then plug a downside scenario into that model to generate losses.
However, financial and economic relationships, as in other spheres, can be very different in the tails of the distribution than in the middle of it. One particular concern is the role that using an empirical model to map scenarios onto loss rates, or some other 'stress' variable, will unduly influence the range of outcomes.
Put simply, if the model is geared towards fitting the central tendency of some dependent variable or data series, it may be poor at capturing the tails of the distribution, which is what stress tests try to explore. For this reason, some recent research has focused on more flexible models that allow these relationships to change as we move into the tail of the distribution (see Covas, Rump, & Zakrajsek, 2013). These quantile models, as introduced by Koenker & Hallock (2001), may offer a better guide to stressed outcomes, but they are not yet widely employed.
However, even these models may still fall short.
Critically, the fit of the model will still play an important role in determining stressed outcomes in such conditional analysis: if the overall fit of the model is poor -even for central tendencies -then we should not expect it to perform well when predicting tail outcomes.

Simulation analysis: where scenarios fall short
To investigate this phenomenon, a variety of simulations were constructed to compare and contrast the performance of simple proxies for scenario-based and VaR-based stress tests. The following simulation results are not meant as sophisticated statistical advancements of either VaR or scenario-based stresstesting approaches; instead, they are deliberately simple in order to clearly illustrate a key issue when gauging differences between the two approaches. The focus in this section is on comparing simple stress tests based on past events relative to out-of-sample scenario-based approaches.
The methodological framework for the simulation testing is deliberately simple in order to illustrate the differences between VaR and scenario-based ap- Alongside this series, an 'indicator' variable, Y t , was generated for the reference series, where the error between the indicator and reference series was constrained to be gaussian, and the signal-noise ratio was randomly selected: Indicator series: After resuming the underlying DGP for the reference series for another 25 observations, stress multiples (equation 1) were constructed for these scenario-based stresses. This is, in essence, a very simple scenario-based stress test.
At the same time, to provide a useful comparison within the same methodological framework, an equivalent backward-looking VaR-based stress was also simulated over the out-of-sample period. In these instances, the stressed reference series were constructed purely from the distribution of the observed reference series over the preceding 125 periods. Importantly, the level of stress was aligned with that for the scenariobased stress; thus, if the stress scenario corresponded to a '1 in 100' event, so too did the simple VaR-based stresses. As before, resulting stress multiples were calculated using equation (1).
These approaches are deliberately simplistic, but was typically more 'stressful' in terms of generating higher stressed multiples.
Results for the simple autoregressive DGP are presented below where the autoregressive term in (2) was a random draw between zero and one. Figure 3 presents stress multiples for the scenario-based approach on the left-hand side and multiples for the VaR-based approach on the right-hand side. The results are based on 10,000 simulations, stressing to a '1 in 100' event.
The horizontal axis in both charts corresponds to the fit of the indicator model used in the scenariobased stresses. Importantly, both vertical axes have been truncated so that the pattern of main results is not compressed by outliers. Consequently, Figure 4 also summarizes the distributions of the simulated stressed outcomes.
Two key results emerge from the simulations. First, model fit does appear to influence the degree of stress in scenario-based approaches; in the left-hand panel of Figure 2, the results cluster closer to 1 (indicating no gap between stressed and expected outcomes) when model fit is poor and are higher and more dispersed when model fit is better. However, the impact of model fit is relatively small overall: even models that captured the reference series (i.e., fitted the DGP) relatively well could still result in multiples that are not very different from badly-fitting models. Importantly, the higher level and dispersion of stress multiples for better-fitting models in the scenario-based approach was not consistently evident across different DGPs for the reference series.
Second, and more importantly, the degree of 'stress' from the scenario-based approach is typically significantly less than that from the VaR-based approach. The full distribution results presented in Figure 4 confirm this, and formal statistical tests also indicate that the VaR-based approach was more stressful than the scenario-based one. Importantly, these broad results were unaffected even when the underlying DGP for the reference series was changed (for instance, from equation (2) to equation (3)).
This result reflects the 'error' between the indicator and reference series as noted in equation (4).
In the presence of any volatility in this error term, which in this context can be taken as a proxy for the inability of the scenario model to completely explain the variance of the reference series, the distribution of fitted outcomes from the scenario model will be smaller than the distribution of the historical reference series by definition. As long as the variance of In practice, the best that the scenario-based approach can achieve, where the scenario is based on past values of the indicator series, is to match the performance of the backward-looking VaR approach: this would correspond to the noise series having zero variance, i.e., being a constant that is subsumed by the estimated OLS coefficients. As long as any noise in the   indicator series is orthogonal to the reference seriesas implied by OLS models -then the VaR-based approach will tend to generate more stressful outcomes, on average, than simple scenario-based models.

Percentage of simulations
To obtain more extreme stressed outcomes from the simple indicator model, more extreme scenariosthat is, those that are outside the historical experience captured by the indicator series -must be applied. The analogous approach, in a VaR-based stress, would be to calibrate the stress based on some non-observed shift in the series. Such a calibration would be subject to considerable uncertainty, as indeed would the choice of an 'extreme' economic scenario. However, in principle, one approach would be to base it on past shocks seen in other spheres or instances, such as that shown in Figure 1.
A key finding from this analysis is that to match the implied degree of stress from VaR-based approaches, simple scenario-based stress tests need to employ more extreme scenarios -in this instance, worse than '1 in 100' -than those used in those VaR-based approaches.
Importantly, this gap between VaR-based and scenario-based stresses was only overturned when the functional form of the scenario model changed, in particular when a quantile model was estimated in place of equation (5), with the estimation quantile specified to a similar percentile as the stresses. However, the gap between the two approaches was not especially pronounced, as illustrated by Figure 5: the mean (median) extra stress from the quantile scenario approach was approximately 13% (10%). Quantile modelling therefore buys stress testers more challenging outcomes than using a simple VaR-based approach, but at the cost of increased complexity.
This analysis indicates that the recent focus on scenario-based approaches does not automatically deliver more stressful outcomes, relative to a VaR-based approach; when both are conducted on a similar basis using historical models, the stress is likely to be lower in the former approach than in the latter.
Perhaps accordingly, some recent research has focused on reviving interest in combining VaR approaches with scenario-driven stress tests, building on Berkowitz (1999 Britton et al, 1998) and also publishes the parameters behind these fan charts, it is possible to (re)construct the entire forecast distribution and focus on different percentiles as desired. It is unlikely that policymakers genuinely want to hold banks to such an exacting standard. But, if they are aware that scenario analysis can potentially understate stresses relative to other approaches, policymakers may be incentivized to create highly stressful scenarios. Unfortunately, one longer-term risk is consequently that these scenarios are deemed to be too incredible and lose public and political support over

Figure 6: Bank of England forecast and stress test in 2014
Source: Adapted from "Stress testing the UK banking system: key elements of the 2014 stress test" by Bank of England (2014, April). Retrieved from http://www.bankofengland.co.uk/financialstability/Documents/fpc/keyelements.pdf time, especially if complex modelling approaches are required. As such, one natural mitigant would be to ensure that alternative approaches such as VaR-based stresses are also employed.
A broader underlying concern is that models based on historical data may not be appropriate when con-

Conclusion
Forecasts and models -and increasingly stress tests -are important tools for policymakers and business planners. However, there is a risk that they are not properly understood. Forecasts, for instance, are best thought of as guides to evolving events that let analysts gauge future 'news' in events and data, rather than as completely accurate guides to that future.
Similar limitations apply to stress tests, which by definition try to describe what analysts do not expect to happen.
Furthermore, the recent shift in focus from VaRbased stress tests to scenario-based stress tests raises particular concerns. First, the appealing narrative hook of a scenario-based approach could distract non-technicians from the inherent complexities and shortcomings of the underlying modelling approach, which will necessarily be flawed and incomplete.
Second, some facets that may be thought of as distinguishing scenario-based tests from VaR-based tests, such as being forward-looking or covering multiple periods, do not in fact distinguish the two approaches; VaR-based tests can be both forward-looking and cover multiple periods. Third, as the analysis in this paper has demonstrated, the stresses arising from simple scenario-based approaches will often be less onerous than those from a similarly-calibrated yet simple VaR-based approach when they are both conducted on the same basis. To address this concern, more complex modelling approaches are required, which in turn may not be widely understood. Happily, there are indications that some policymakers are already aware of this issue, judging from the extreme stress scenarios applied in some recent regulatory stress tests.