Bias-variance trade-off in portfolio optimization under Expected Shortfall with $\ell_2$ regularization

The optimization of a large random portfolio under the Expected Shortfall risk measure with an $\ell_2$ regularizer is carried out by analytical calculation. The regularizer reins in the large sample fluctuations and the concomitant divergent estimation error, and eliminates the phase transition where this error would otherwise blow up. In the data-dominated region, where the number $N$ of different assets in the portfolio is much less than the length $T$ of the available time series, the regularizer plays a negligible role even if its strength $\eta$ is large, while in the opposite limit, where the size of samples is comparable to, or even smaller than the number of assets, the optimum is almost entirely determined by the regularizer. We construct the contour map of estimation error on the $N/T$ vs. $\eta$ plane and find that for a given value of the estimation error the gain in $N/T$ due to the regularizer can reach a factor of about 4 for a sufficiently strong regularizer.


Introduction
The current international market risk regulation measures risk in terms of expected shortfall (ES) [1]. In order to decrease their capital requirements, financial institutions have to optimize the ES of their trading book.
Optimizing a large portfolio may be dicult under any risk measure, but becomes particularly hard in the case of Value at Risk (VaR) and ES that discard a large part of the data except those at or above a high quantile. Without some kind of regularization, this leads to a phase transition at a critical value r c of the ratio r = N/T where N is the dimension of the portfolio (the number of dierent assets or risk factors) and T is the sample size (the length of the available time series). This critical ratio depends on the confidence level α that determines the VaR threshold above which the losses are to be averaged to obtain the ES. Beyond r c it is impossible to carry out the optim ization, and upon approaching this critical value from below the estimation error increases without bound.
The estimation error problem of portfolio selection has been the subject of a large number of works, [2][3][4][5][6][7][8][9][10][11][12] are but a small selection from this vast literature. The critical behavior and the locus of the phase boundary separating the region where the optimization is feasible from the one where it is not has also been studied in a series of papers [13][14][15][16][17].
In the present note we discuss the eect of adding an 2 regularizer to the ES risk measure. As noted in [18] and [19], the optimization problem so obtained is equivalent to one of the variants of support vector regression (ν-SVR) [20], therefore its study is of interest also for machine learning, independently of the portfolio optimization context. Concerning its specific financial application, 2 regularization may have two dierent interpretations. First, 2 has the tendency to pull the solution towards the diagonal, where all the weights are the same. As such, 2 represents a diversification pressure [21,22] that may be useful in a situation where, e.g. the market is dominated by a small number of blue chips. Alternatively, the portfolio manager may wish to take J. Stat. Mech. (2019) 013402 into account the market impact of the future liquidation of the portfolio already at its construction. As shown in [17], market impact considerations lead one to regularized portfolio optimization, with 2 corresponding to linear impact.
In this paper we carry out the optimization of 2 -regularized ES analytically in the special case of i.i.d. Gaussian distributed returns, in the limit where both the dimension and the sample size are large, but their ratio r = N/T is fixed. The calculation will be performed by the method of replicas borrowed from the statistical physics of disordered systems [23]. The present work extends a previous paper [24] by incorporating the regularizer. By preventing the phase transition from taking place, the regularizer fundamentally alters the overall picture (in this respect, the role of the regularizer is analogous to that of an external field coupled to the order parameter in a phase transition). As the technical details of the replica calculation have been laid out in [13] and, in a somewhat dierent form, in [17], we do not repeat them here. Instead we just recall the setup of the problem and focus on the most important result: the relative estimation error (closely related to the out-of-sample estimator of ES) as a function of r = N/T and the strength of the regularizer η.
Our results exhibit a clear distinction between the region in the space of parameters where data dominate and the regularizer plays a minor role, from the one where the data are insucient and the regularizer stabilizes the estimate at the price of essentially suppressing the data. Thereby, our results provide a clean and explicit example of what statisticians call the bias-variance trade-o that lies at the heart of the regularization procedure. We find that the transition between the data-dominated regime and the bias-dominated one is rather sharp, and it is only around this transition that an actual trade-o takes place. Following the curves of fixed estimation error on the r − η plane we can see that r increases with increasing η by a factor of approximately 4. Beyond this point the contour lines turn back and we go over onto a branch of the contour line where the optimization is determined by the regularizer rather than the data. The plan of the rest of the paper is as follows. In section 2 we present the formalism, in section 3 display our results and in section 4 draw our conclusions.

The optimization of regularized ES
The simple portfolios we consider here are linear combinations of N risk factors, with returns x i and weights w i , i = 1, 2, ..., N : The weights will be normalized such that their sum is N , instead of the customary 1: The motivation for choosing this normalization is that we wish to have weights of order unity, rather than 1/N , in the limit N → ∞. Apart from the budget constraint, https://doi.org/10.1088/1742-5468/aaf108 J. Stat. Mech. (2019) 013402 the weights will not be subject to any other condition. In particular, they can take any real value, that is we are allowing unlimited short positions. We do not impose the usual constraint on the expected return on the portfolio either, so we are looking for the global minimum risk portfolio. This setup is motivated by simplicity, but we note that tracking a benchmark requires exactly this kind of optimization to be performed. The probability for the loss ({w i }, {x i }) = −X to be smaller than a threshold 0 is: where p({x i }) is the probability density of the returns, and θ(x) is the Heaviside function: θ(x) = 1 for x > 0, and zero otherwise. The VaR at confidence level α is then defined as: Expected Shortfall is the average loss beyond the VaR quantile: Portfolio optimization seeks to find the optimal weights that make the above ES minimal subject to the budget constraint (2). Instead of dealing directly with ES, Rockafellar and Uryasev [25] proposed to minimize the related function over the variable and the weights w i : where The probability distribution of the returns is not known, so one can only sample it, and replace the integral in (4) by time-averaging over the discrete observations. Rockafellar and Uryasev [25] showed that the optimization of the resulting loss function can be reduced to the following linear programming task: minimize the cost function under the constraints It is convenient to introduce the regularizer at this stage, by adding the 2 -norm to the loss function [17]: https://doi.org/10.1088/1742-5468/aaf108 where and u are auxiliary variables, and the coecient η sets the strength of the regularization.
As the constraint on the expected return has been omitted, we are seeking the global optimum of the portfolio here. If the returns x it are i.i.d. Gaussian variables and N , T → ∞ with r = N/T fixed, the method of replicas allows one to reduce the above optimization task in N + T + 1 variables to the optimization of a cost function depending on only six variables, the so-called order parameters [13,17]: where · z is an average over the standard normal variable z, and One can readily see that the stationarity conditions are: where the variable w * is that value of the weight that minimizes the 'potential' V in (13). (The prime means derivative with respect to the argument.) Three of the order parameters are easily eliminated and the integrals can be reduced to the error function and its integrals by repeated integration by parts, as in [24]. Finally, one ends up with three equations to be solved: These functions are closely related to each other: Φ(x) is the derivative of Ψ(x) and Ψ(x) is the derivative of W (x). As explained in [24], each of the three remaining order parameters in the above set of equations, q 0 , ∆, and has a direct financial meaning: ∆ is related to the in-sample estimator of ES (and also to the second derivative of the cost function F with respect to the Lagrange multiplyer λ associated with the budget constraint) and is the insample VaR of the portfolio optimized under the ES risk measure. Our present concern is the order parameter q 0 , which is a measure of the out-of-sample estimator of ES. As shown in [24], if ES out is the out-of-sample estimate of ES based on samples of size T , and ES (0) is its true value (that would obtain for N finite and T → ∞), then that is √ q 0 − 1 is the relative estimation error of the out-of-sample estimate.
The task now is to solve the stationary conditions and get the cost function by substituting the solutions back into equation (12). The in-sample value of Expected Shortfall is simply related to the cost function as: The fundamental cause of the divergence of estimation error in the original, nonregularized problem is that ES as a risk measure is not bounded from below. In finite samples it can happen that one of the assets, or a combination of assets, dominates the others (i.e. produces a larger return than the others in the given sample), thereby leading to an apparent arbitrage: one can achieve an arbitrarily large gain (an arbitrarily large negative ES) by going very long in the dominant asset and correspondingly short in the others [13,26]. It is evident that this apparent arbitrage is a mere statistical fluctuation, but along a special curve in the r − α plane this divergence occurs with probability one [13]. As a result, the estimation error will diverge along the phase boundary shown in figure 1. Note that in high-dimensional statistics where regularization is routinely applied the loss is always bounded both from above and below. The setting in the present paper is, therefore, very dierent from the customary setup, which explaines the unusual results.
The purpose of regularization is to penalize the large fluctuations of the weight vector, thereby eliminating this phase transition.
Since ES is a piecewise linear function of the weights, the quadratic regularizer will overcome excessive fluctuations, no matter how small the coecient η is. Deep inside the region of stability (shown by pale yellow in figure 1), a weak regularizer (small η) will modify the behavior of various quantities very little. In contrast, close to the phase boundary, and especially in the vicinity of the point α = 1, r = 0.5 , where the solution has an essential singularity, the eect of even a small η is very strong, and beyond the yellow region, where originally there was no solution, the regularizer will dominate the scene. In the region where the solution is stable even without the regularizer, r = N/T is small, which means we have an abundance of data. We call this region the datadominated region. In the presence of the regularizer we will find finite solutions also far beyond the phase boundary, but here the regularizer dominates the data, so we can call this domain the bias-dominated region.

Results
The solution of the stationarity conditions can be obtained with the help of a computer. In the following, we present the numerical solutions for the relative estimation error. The results will be displayed by constructing the contour map of this quantity, which will allow us to make a direct comparison between our present results and those in [24].
In figure 2 we recall the contour map of the relative estimation error of ES without regularization.
As can be seen, without regularization the constant q 0 curves are all inside the feasible region. For larger and larger values of q 0 the corresponding curves run closer and closer to the phase boundary, along which the estimation error diverges. Note that the phase boundary becomes flat, with all its derivatives vanishing, at the upper right corner of the feasible region: there is an essential singularity at the point α = 1, r = 0.5.
The estimation error problem is very clearly illustrated in this figure: the curves corresponding to an acceptably small relative error are the lowest ones among the q 0 contour lines, and the value of r = N/T corresponding to a confidence limit α in the J. Stat. Mech. (2019) 013402 vicinity of 1 (such as the regulatory value α = 0.975) are extremely small on these low lying curves. These small values of r require an unrealistically large sample size T if N is not small. For example, at the regulatory value of α = 0.975, to achieve an estimation error smaller than 5%, for a portfolio with N = 100 assets one would need a time series of more than 7200 data points [24].
Let us see how regularization reorganizes the set of constant q 0 curves. above it. Along the lower branch the value of the ratio r is small, which means we have very large samples with respect to the dimension: this is the data-dominated regime. We can also see that when the data dominate, the dependence on the regularizer is weak: the set of curves inside the yellow region is quite similar in the two figures, even though the regularizer has been increased 5-fold from figure 3 to 4. Following the curve corresponding to a given value of q 0 , say the black one, we see that at the beginning it is increasing with α, in the vicinity of α = 1 it starts to decline, then it sharply turns around and shoots up steeply, leaving the feasible region and increasing with decreasing α. Along this upper branch the ratio r is not small any more. We do not have large samples here, in fact, the situation is just the opposite: the dimension N becomes larger than the size T of the samples. Clearly, in this regime the regularizer dominates and the data play only a minor role: this is the bias-dominated regime. It is interesting to note the sudden turn over of the constant q 0 curves in the vicinity of α = 1. Such a sharp feature would be extremely hard to discover if we wanted to solve the original optimization problem by numerical simulations: the simulation would jump over to the upper branch before we could observe the sharp dip and the identification of the boundary between the data-dominated and the bias-dominated regimes would be hard. This is even more so for real life data which are inevitably noisy. An important point in regularization is the correct choice of the parameter η. When data come from real observations, and the size of the sample (or the number of samples) is limited by time and/or cost considerations, the standard procedure is cross validation [27], i.e. using a part of the data to infer the value of η and checking the correctness of this choice on the other part. In the present analytical approach we have the luxury of infinitely many samples to average over, so we can obtain the value of the coecient of regularization by demanding a given relative error (that is a given q 0 ) for a given confidence limit α and given aspect ratio r = N/T . Figure 5 displays the plot of the given estimation error curves on the r − η plane for the specific value of α = 0.975 demanded by the new market risk regulation [1], and relative errors of 1%, 5% and 10%, respectively. It shows the change-over between the data-dominated resp. bias-dominated regimes very clearly. For a given value of r the corresponding value  of η can be read o from the curves. If r is small (i.e. the sample is large with respect to the dimension) the curves with the prescribed values of relative error are almost horizontal. This means that when we have sucient data the value of the regularizer is more or less immaterial: within reasonable limits we can choose any coecient for the regularizer, it will not change the precision of the estimate, because in this situation the data will determine the optimum. Conversely, when the data are insucient (r is not small, or it is even beyond the feasible region), the value of η necessary to enforce a given relative error strongly depends on r. In this region, however, we need a smaller and smaller η to find the same relative error, because here the data almost do not matter and even a small regularizer will establish the optimum. The transition between these two regimes takes place around the points where the curves turn back. This happens still inside the feasible region, and the width of this range is rather small: from the r value corresponding to η = 0 to the one where the curves turn around the increase of r always remains within a factor of about 4. Let us take a closer look at that part of the previous figure where the curves turn around and r starts to increase. Figure 6(a) shows this region in higher resolution. For a given, small, value of the estimation error (such as 1% or 2%), r grows by a factor of about four by the time we reach the elbow of the curves (at rather large η values). This means that for a given sample size T the regularization allows us to consider a four times larger portfolio without increasing the estimation error. Conversely, for a given value of N the regularizer allows the use of four times shorter time series without compromising the quality of the estimate. Of course, the growth of r could be followed beyond the elbow, up to higher values along the constant q 0 curves, but it must be clear that these sections of the curves correspond to a situation where the estimate is mostly or entirely determined by the regularizer. This is also shown by the fact that the curves of given estimation error strongly lean backwards to the vertical axis: where the dimension is high and the data few even a weak regularizer can stabilize the estimate, but it will then have nothing to do with the information coming from the time series.  A gain of a factor four in the allowed region in r could be regarded as very satisfactory, were it not for the fact that the initial (η = 0) value of r along the small estimation error curves is so small that it remains small even after a multiplication by 4. η r 5% 1% 10% Figure 5. The overall behavior of the contour lines of fixed estimation error (fixed q 0 ) on the r = N/T − η plane, for a given value of the confidence limit α = 0.975 and for three dierent values (1%, 5% and 10%) of the relative estimation error. The data-dominated and bias-dominated regions correspond to the two branches of these curves: in the range of small r's the curves depend on the strength of the regularizer very weakly, while for r's in the vicinity of the phase boundary, and even more for large r's high in the originally unfeasible region, the fixed estimation error curves display a strong dependence on η.  If we inspect another curve, corresponding to a larger estimation error (say, 5%), we can see that it turns back for a much smaller η, but the relative increase of r up to the elbow is still about a factor 4. We can also see that beyond this point the curves very quickly reach the region where the regularizer dominates. Figure 6(b) displays the same curves as in figure 6(a), but this time they are normalized by their vales at η = 0, so that they show the gain in r due to the regularizer.

Conclusion
We have considered the problem of optimizing ES in the presence of an 2 regularizer for the case of uncorrelated Gaussian returns. The regularizer takes care of the large sample fluctuations and eliminates the phase transition that would be present in the problem without regularization. Deep inside the feasible region, where we have a large amount of data relative to the dimension, the size of the sample needed for a given level of relative estimation error is basically constant, largely independent of the regularizer. In the opposite case, for sample sizes comparable to or small relative to the dimension, the regularizer dominates the optimization and suppresses the data. The transition between the the data-dominated regime and the regularizer-dominated one is rather narrow. It is in this transition region where we can meaningfully speak about a trade-o between fluctuation and bias, otherwise one or the other dominates the estimation. The identification of this transitional zone is easy within the present scheme, where we could carry out the optimization analytically: the transitional zone is the small region where the curves in figure 5 sharply turn back, but still remain inside the originally feasible region. In real life, where the size of the samples can rarely be changed at will and where all kinds of external noise (other than that coming from the sample fluctuations) may be present, the distinction between the region where the data dominates and where the bias reigns may be much less clear, and one may not be sure where the transition takes place between them. Below this transition there is not much point in using regularization, because the data themselves are sucient to provide a stable and reliable estimate. Above the transition zone it is almost meaningless to talk about the observed data, because they are crowded out by the bias. The identification of the relatively narrow transition zone between these two extremes and the gain of a factor four below the transition are the main results of this paper.
It is important to stress that the results we presented in this paper have been obtained under the simplifying assumption that asset returns are i.i.d. Gaussian variables. This was done partly for consistency with previous work [13][14][15][16][17], but, more importantly, for mathematical convenience, as under this assumption it is relatively easy to provide an analytical characterization of the problem. With respect to real market data the assumption is unrealistic for two reasons: first, the distribution of returns is known to be dierent from a normal distribution, it displays fat tails, although with the general acceleration of trading the convergence to the Gaussian is becoming faster, especially on the time scales relevant for portfolio rebalancing. Second, the market is characterized by a non-trivial structure of correlations between assets, with modes representing the movement of the market as a whole and of the dierent market sectors.