Applied Machine Learning for Stochastic Local Volatility Calibration

Stochastic volatility models are a popular choice to price and risk–manage financial derivatives on equity and foreign exchange. For the calibration of stochastic local volatility models a crucial step is the estimation of the expectated variance conditional on the realized spot. The spot is given by the model dynamics. Here we suggest to use methods from machine learning to improve the estimation process. We show examples from foreign exchange.


INTRODUCTION
For derivatives pricing a major breakthrough was achieved with the risk-neutral valuation principle (Black and Scholes, 1973). Initially the models assumed a deterministic, state-independent volatility of the underlying price process. For many classes of underlying this assumption is too restrictive as it does not allow for an implied volatility that depends on strike as it is observed in the market, at least since the Black Friday in 1987, see for a review and attempted explanation (Benzoni et al., 2011).
Hence the most natural extension of the existing models was to postulate either a state-dependent volatility often duped as local volatility (Derman and Kani, 1994;Dupire, 1994) or to postulate an additional process for the volatility (e.g., Hull and White, 1987;Heston, 1993) which are labeled as stochastic volatility models.
Looking at the properties of these two model classes it was found (Hagan et al., 2002) that local volatility is postulating a dynamics which is not found in real markets. In foreign exchange options markets stochastic volatility models tend to exaggerate the effect of volatility convexity and at the same time these models are unable to match the short-dated volatility smile observed in market-prices. As a practical workaround, models that mix the local volatility and stochastic volatility were developed (Said, 1999;Blacher, 2001). It was observed that the calibration of SLV models is a hard problem which requires either a specific parametrization to derive fast pricing of vanilla options or quite time-consuming numerical optimization procedures (Guyon and Henry-Labordere, 2011). See as well Homescu (2014) for a great summary and best practice of local stochastic volatility models.
A shortcut to derive manageable calibration times was developed by Guyon and Henry-Labordere (2011) and Van der Stoep et al. (2014) using a Monte Carlo procedure to derive the required estimation of the conditional variance. In this paper we suggest to use methods from machine learning, in particular radial basis functions and variations thereof to derive fast and efficient estimators.

LOCAL STOCHASTIC VOLATILITY CALIBRATION
The LSV model in general is of the form: with spot S t , variance V t , drift µ, (state-dependent) drift for the variance µ V , vol of variance ξ χ(V t ), and correlation ρ. The LSV calibration is the process to determine the leverage function σ given the local volatility function σ 2 Dupire and all the other parameters of the model. There is a fundamental relationship of the leverage function and the local volatility function (Dupire, 1996) where the expectation E P(S t ,V t ,σ ) of the conditional variance V t is taken with respect to the risk-neutral measure induced by the model. The notation indicates that P(S t , V t , σ ) is the joint probability of spot process S t , variance process V t and the solution for σ which depends on the probability distribution of (S t , V t ).
Plugging the solution into the model equation makes this a McKean SDE where the expectation depends on the probability of the process itself.
To solve this equation Monte Carlo simulation can be used. The equations are discretized and the forward propagation of the spot S t and variance V t is interleaved with the estimation of conditional expectation using the realized paths of S t and V t . Contrary to standard Monte Carlo where all paths develop independently we need to bring all simulated paths to the estimation procedure. Using Euler discretization with W, X independent increments. The estimation of the conditional expectation can be seen as finding the function R(S) = E(V t |S t = S) based on the samples as observed pairs where the spot S i t and variance V i t are the time t realizations of spot and variance on path i. Originally it was proposed to estimate the function R using kernel regression (Guyon and Henry-Labordere, 2011): with Kernel functions K h , where we dropped the t index as it is clear from the context. Alternatively (Van der Stoep et al., 2014) proposed to use binning techniques or sets of polynomials.
Subsequently we will evaluate alternative regression techniques to estimate the conditional expectation based on the realized paths. This can be rephrased as a supervised learning problem where each path is a (noisy) example.

REGRESSION AS A SUPERVISED LEARNING PROBLEM
The task to find a relationship between some input variables and an output from examples is one of the problems tackled by machine learning and is well studied as supervised learning. There are many classes of supervised learning algorithms and setups and we would like to demonstrate guidelines to which specific choices are suitable for the problem at hand. The basic problem is, given a set of examples x i , y i to find a function f (x) such that an error functional is minimized. The task is to find a function such that there is low error on unseen examples, this is called generalization. There is a balance to strike between the error on the examples used for training and the error on the validation set of examples which are examples not used during training, for a fundamental analysis of the learning theotry and the relation between capacity and the generalization (see e.g., Vapnik, 2013), in particular chapter 4.

Kernel Regression
The approach taken in Guyon and Henry-Labordere (2011) as stated above is Nadarajan-Watson kernel regression which is one of the so called non-parametric methods. The method is identical to Equation (2). The estimator is given as: In this approach a Kernel function K(x) is used, which satisfies: There is a variety of Kernel functions well studied in the literature (Härdle, 1990): Often the Kernel function used is Gaussian hence the support of the function is infinite or it will be the Epanechnikov Kernel which has bounded support.
The crucial choice is the bandwidth of the Kernel functions. There is a rule-of-thumb derived from normal distribution assumptions, (Silverman, 1986): for the standard deviation σ of the data and n data points. Alternatively cross validation, particular "leave-one-out cross validation" can be used to determine an optimal Kernel width. Cross validation is quite costly computationally and hence can only be used to cross check ad-hoc choices.
Local Linear Kernel Regression is a variation of Kernel regression which employs local linear terms and which is given by the solution of The minimum is found by solving a 2 × 2 linear system .
In general the Kernel approaches suffers from some systematic shortcomings, mainly the fact that all examples are used, no compression happens, secondly a bias is introduced close to the boundary and the difficult choice of suitable bandwidth, where practically sound theoretical methods as cross validation cannot be used for computation time reasons.

Radial Basis Functions
Radial Basis Functions (RBF) and Partition of Unity Radial Basis Functions (PURBF) respectively take the form PURBF are quite similar in functional form to Kernel regression. The main difference is that the number of basis functions is much smaller than the number of examples. It was proven that RBF and PURBF are universal function approximators (Hakala et al., 1994) which makes them suitable to approximate our estimation problem. If the L 2 norm is used the weights are optimized by solving the normal equation The solution s given by the weights w i , which satisfy     Frontiers in Artificial Intelligence | www.frontiersin.org  with the matrix A ij given as The remaining parameters are determined heuristically: • centers c i -in 1D we chose min(x i ), max(x i ), and a random subset of the remaining x i . Centers which are too close (in their local width) are pruned-the criterion for unit j is ≤ with a suitable global pruning constant . There are resource allocating approaches as well (e.g., Fritzke, 1994) which seem less suitable here.
• width h i -are chosen individually as a multiple of the k nearest neighbors or ad-hoc.

Regularization
Often the solution of the normal equation will be ill-conditioned.
To counteract the bad conditioning of the problem and to get a better generalization we will use a regularizer on the L 2 norm of the weights (e.g., Goodfellow et al., 2016, Chapter 7.1).
The corresponding solution is given as with identity matrix id. The same solution applies for the PURBF function instead of the RBF one.

Computational Efficency
For standard Kernel Regression computational effort is mainly due to sorting O(n log(n)) of the spot observations to enable an efficient lookup of relevant spot observations during the retrieval phase. Optimal determination of width (crossvalidation) requires the evolution of all kernels at all points several times which is very costly compared to the lookup. Local linear Kernel Regression requires an additional inversion of a 2 × 2 matrix which is negligible. For RBF and PURBF the solution to a small linear system is required. In particular the size is much smaller than the number of samples. Sorted examples can be used to optimize the training as the required matrix is determined by sums over the samples. Width and pruning computations require local computation of the order of the number of kernel functions. Overall the computational effort for RBF/PURBF is comparable and might be smaller in the retrieval phase than for Kernel Regression itself. Frontiers in Artificial Intelligence | www.frontiersin.org

Alternative Architectures
In the last couple of years popularity of multilayer perceptron (MLP) and deep versions thereof grew enormously. For our application we rule out these architectures as the training is much more involved in the MLP case with a many remaining questions about a suitable number of hidden units, number of layers, type of activation functions. We could envision to use a pretrained MLP to get the solution without training. We postpone this approach for potential future use.

APPLICATION TO LSV MODEL IN FOREIGN EXCHANGE
The model we will study is of Heston type with mean reversion speed κ, mean reversion levelV.
The advantage is that we have a semi-closed form solution for vanilla call-and put options in the Heston model without the leverage function hence the first step is to calibrate the Heston model and then apply a scaling to the vol of vol parameter to reduce the SV impact and to let the local volatility compensate to match the vanilla option market. In this study we will use a volatility mixing of 66% which means that we scale the vol of variance by this factor before calibrating of the leverage function.
To compare the performance of the various regression algorithms on this model we will show for a specific slice the realized spot/variance and the corresponding results of the regression functions.

Example EUR/USD 6M
You can see the volatility surface in Figure 1 and the corresponding local volatility surface in Figure 2. The snapshot of data, including spot, volatility, and interest rates was taken in March 2018. We show the results of different kernel estimators, using Silverman's rule of thumb for the width, including the samples indicated as Current, as well as the forward and the level of a 0.1% digital on the upside and downside Figure 3. In Figure 4 you see the results of Kernel Regression for Silverman's rule and additionally the cross-vaildated width for the same kernel functions. Notice that the optimal width varies between different kernels. We show results for local linear kernel regression in Figure 5. It can be seen that the bias at the boundaries is reduced in comparison to the kernel regression. Again we show results for Silverman's rule and additionally the cross-vaildated width.
For the PURBF we show results in Figure 6 using a global width, relative knn width, pruned and relative knn width, pruned with global width, and pruned, knn width and regularizer (λ = 0.2). We use 40 units in all cases as this number seems sufficently versatile for the number of particles we want to use (2,048).
The last version with regularizer, pruning, and local width is the preferred version as it shows a smooth behavior without a bias in the boundaries and matches the part with many data-points in the middle without oscillations.

Example EUR/USD 5Y
We show the results for 5Y maturity and the same volatility surface in Figure 7. Among the tested approaches the PURBFwith 5 nearest neighbors performs best.

Example USD/JPY 5Y
We show the results for USD/JPY, see the local volatility surface in Figure 8. The estimation across the spot range is shown in Figure 9. Again the PURBFwith 5 nearest neighbors performs best.

Example EUR/BRL 3Y
We show the results for EUR/BRL, which is a highly skewed and highly drifting underlying. See the local volatility surface in Figure 10. The estimation across the spot range is shown in Figure 11. Note that in this case the range of spot realizations is quite skewed as is expected from the skewed volatility surface. Nevertheless the PURBFwith 5 nearest neighbors puts a relatively smooth estimator through the samples and performs better than other methods.

Pricing Examples
To see the impact on exotics pricing we look at one-touch options. A one-touch option pays one unit of the counter currency at the maturity date if the spot trades at or beyond the touch-level at any time during the life of the option. We show the impact as a function of the Black-Scholes price (TV), similar to (Clark, 2011). The TV of a one-touch can be between 0% and the discount factor to maturity, which is in the range of 100%. For fixed market parameters like spot, volatility and the risk-neutral drift TV is a function of the touch-level only, hence makes a unique scale to show the model impact. The deviation of the LSV model price from the TV is the desired effect of an alternative model, which incorporates volatility risk management    and hedging, compared to the Black-Scholes model. The form of the deviation is not obvious and would require a rather complicated hedging argument of volatility risk and cross spotvolatility risk.
With the mixed local stochastic volatility model and mixing rate of 66% we expected the LSVprice to be within the bounds of stochastic and local volatility price. We use Monte Carlo pricing with a fixed number of 32,000 paths (antithetic) and Quasi Random Numbers, a time discretization of 5 days and fixed 2048 particles. We denote the Black-Scholes prices as BS or TV (theoretical value) in the graphs and use LV as abbreviation for prices in local volatility and HES for the Heston model without local volatility component. The prices can be seen in Figure 12 for EUR/USD 6M and in Figure 13 for EUR/BRL 6M. We observe the expected behavior in all cases, the mixed local stochastic volatility prices are within the range of local and stochastic volatility prices and the mixing parameter can be used to adjust the behavior to observed exotics prices (e.g., one touches) in the market. Usually this mixing parameter is quite stable across longer periods, often weeks or even months.

CONCLUSION
We apply machine learning principles to improve the calibration process of the local stochastic volatility models. The suggested meta parameters and heuristics seem to apply to a wide variety of underlyings in FX, liquid pairs like EUR/USD as well as emerging markets as EUR/BRL. The computational efficiency is at about the same level as for the formerly suggested Kernel Regression based approach. The results given by the PURBF function with pruning, regularization, and local width determined by 5 nearest neighbor performed significantly better than the Kernel based approaches, hence we would suggest to consider this approach in the calibration process.
Further work will be dedicated to improve the computational speed and to establish better measures of the quality. In particular in situations where vol surfaces are almost arbitragable we will need the method to continue to provide numerically stable results.

AUTHOR CONTRIBUTIONS
The author confirms being the sole contributor of this work and has approved it for publication.

ACKNOWLEDGMENTS
I would like to thank the Quant team of Leonteq to provide such a stimulating atmosphere, providing an environment to engage on a wide variety of practical and research oriented aspects of Quantitative Finance. In particular thanks to Nadzeya Bedziuk and Dmitry Davydov to critically crosscheck the document, still all errors are due to me.