Regularizing portfolio optimization

The optimization of large portfolios displays an inherent instability due to estimation error. This poses a fundamental problem, because solutions that are not stable under sample fluctuations may look optimal for a given sample, but are, in effect, very far from optimal with respect to the average risk. In this paper, we approach the problem from the point of view of statistical learning theory. The occurrence of the instability is intimately related to over-fitting, which can be avoided using known regularization methods. We show how regularized portfolio optimization with the expected shortfall as a risk measure is related to support vector regression. The budget constraint dictates a modification. We present the resulting optimization problem and discuss the solution. The L2 norm of the weight vector is used as a regularizer, which corresponds to a diversification ‘pressure’. This means that diversification, besides counteracting downward fluctuations in some assets by upward fluctuations in others, is also crucial because it improves the stability of the solution. The approach we provide here allows for the simultaneous treatment of optimization and diversification in one framework that enables the investor to trade off between the two, depending on the size of the available dataset.


Introduction
Markowitz' portfolio selection theory [1,2] is one of the pillars of theoretical finance.It has greatly influenced the thinking and practice in investment, capital allocation, index tracking, and a number of other fields.Its two major ingredients are (i) seeking a tradeoff between risk and reward, and (ii) exploiting the cancellation between fluctuations of (anti-)correlated assets.In the original formulation of the theory, the underlying process was assumed to be multivariate normal.Accordingly, reward was measured in terms of the expected return, risk in terms of the variance of the portfolio.
The fundamental problem of this scheme (shared by all the other variants that have been introduced since) is that the characteristics of the underlying process generating the distribution of asset prices are not known in practice, and therefore averages are replaced by sums over the available sample.This procedure is well justified as long as the sample size, T (i.e. the length of the available time series for each item), is sufficiently large compared to the size of the portfolio, N (i.e. the number of items).In that limit, sample averages asymptotically converge to the true average due to the central limit theorem.
Unfortunately, the nature of portfolio selection is not compatible with this limit.Institutional portfolios are large, with N's in the range of hundreds or thousands, while considerations of transaction costs and non-stationarity limit the number of available data points to a couple of hundreds at most.Therefore, portfolio selection works in a region, where N and T are, at best, of the same order of magnitude.This, however, is not the realm of classical statistical methods.Portfolio optimization is rather closer to a situation which, by borrowing a term from statistical physics, might be termed the "thermodynamic limit", where N and T tend to infinity such that their ratio remains fixed.
It is evident that portfolio theory struggles with the same fundamental difficulty that is underlying basically every complex modeling and optimization task: the high number of dimensions and the insufficient amount of information available about the system.This difficulty has been around in portfolio selection from the early days and a plethora of methods have been proposed to cope with it, e.g.single and multi-factor models [3], Bayesian estimators [4,5,6,7,8,9,10,11,12,13,14,15,16,17], or, more recently, tools borrowed from random matrix theory [18,19,20,21,22,23].In the thermodynamic regime, estimation errors are large, sample to sample fluctuations are huge, results obtained from one sample do not generalize well and can be quite misleading concerning the true process.
The same problem has received considerable attention in the area of machine learning.We discuss how the observed instabilities in portfolio optimization (elaborated in Section 2) can be understood and remedied by looking at portfolio theory from the point of view of machine learning.
Portfolio optimization is a special case of regression, and therefore can be understood as a machine learning problem (see Section 3).In machine learning, as well as in portfolio optimization, one wishes to minimize the actual risk, which is the risk (or error) evaluated by taking the ensemble average.This quantity, however, can not be computed from the data, only the empirical risk can.The difference between the two is not necessarily small in the thermodynamic limit, so that a small empirical risk does not automatically guarantee small actual risk [24].
Statistical learning theory [24,25,26] finds upper bounds on the generalization error that hold with a certain accuracy.These error bounds quantify the expected generalization performance of a model, and they decrease with decreasing capacity of the function class that is being fitted to the data.Lowering the capacity therefore lowers the error bound and thereby improves generalization.The resulting procedure is often referred to as regularization and essentially prevents over-fitting (see Section 4).
In the thermodynamic limit, portfolio optimization needs to be regularized.We show in Section 5 how the above mentioned concepts, which find their practical application in support vector machines [27,28], can be used for portfolio optimization.Support vector machines constitute an extremely powerful class of learning algorithms which have met with considerable success.We show that regularized portfolio optimization, using the expected shortfall as a risk measure, is almost identical to support vector regression, apart from the budget constraint.We provide the modified optimization problem which can be solved by linear programming.
In Section 6, we discuss the financial meaning of the regularizer: minimizing the L2 norm of the weight vector corresponds to a diversification pressure.We also discuss alternative constraints that could serve as regularizers in the context of portfolio optimization.
Taking this machine learning angle allows one to organize a variety of ideas in the existing literature on portfolio optimization filtering methods into one systematic and well developed framework.There are basically two choices to be made: (i) which risk measure to use, and (ii) which regularizer.These choices result in different methods, because different optimization problems are being solved.
While we focus here on the popular expected shortfall risk measure (in Section 5), the variance has a long history as an important risk measure in finance.Several existing filtering methods that use the variance risk measure essentially implement regularization, without necessarily stating so explicitly.The only work we found in this context [7] that mentiones regularization in the context of portfolio optimization has not been noticed by the ensuing, closely related, literature.It is easy to show that when the L2 norm is used as a regularizer, then the resulting method is closely related to Bayesian ridge regression, which uses a Gaussian prior on the weights (with the difference of the additional budget constraint).The work on covariance shrinkage, such as [8,9,10,11], falls into the same category.Other priors can be used [17], which can be expected to lead to different results (for an insightful comparison see e.g.[29]).Using the L1 norm has been popularized in statistics as the "LASSO" (least absolute shrinkage and selection operator) [29], and methods that use any Lp norm are also known as the "bridge" [30].

Preliminaries -Instability of classical portfolio optimization.
Portfolio optimization in large institutions operates in what we called the thermodynamic limit, where both the number of assets and the number of data points are large, with their ratio a certain, typically not very small, number.The estimation problem for the mean is so serious [31,32] as to make the trade-off between risk and return largely illusory.Therefore, following a number of authors [8,9,33,34,35], we focus on the minimum variance portfolio and drop the usual constraint on the expected return.This is also in line with previous work (see [36] and references therein), and makes the treatment simpler without compromising the main conclusions.An extension of the results to the more general case is straightforward.
Nevertheless, even if we forget about the expected return constraint, the problem still remains that covariances have to be estimated from finite samples.It is an elementary fact from linear algebra that the rank of the empirical N × N covariance matrix is the smaller of N and T .Therefore, if T < N, the covariance matrix is singular and the portfolio selection task becomes meaningless.The point T = N thus separates two regions: for T > N the portfolio problem has a solution, whereas for T < N, it does not.
Even if T is larger than N, but not much larger, the solution to the minimum variance problem is unstable under sample fluctuations, which means that it is not possible to find the optimal portfolio in this way.This instability of the estimated covariances, and hence of the optimal solutions, has been generally known in the community, however, the full depth of the problem has only been recognized recently, when it was pointed out that the average estimation error diverges at the critical point N = T [37,38,39].
In order to characterize the estimation error, Kondor and co-workers used the ratio q 2 0 between (i) the risk, evaluated at the optimal solution obtained by portfolio optimization using finite data and (ii) the true minimal risk.This quantity is a measure of generalization performance, with perfect performance when q 2 0 = 1, and increasingly bad performance as q 2 0 increases.As found numerically in [38] and demonstrated analytically by random matrix theory techniques in [40], the quantity q 0 is proportional to (1 − N/T ) −1/2 and diverges when T goes to N from above.
The identification of the point N = T as a phase transition [36,41] allowed for the establishment of a link between portfolio optimization and the theory of phase transitions, which helped to organize a number of seemingly disparate phenomena into a single coherent picture with a rich conceptual content.For example, it has been shown that the divergence is not a special feature of the variance, but persists under all the other alternative risk measures that have been investigated so far: historical expected shortfall, maximal loss, mean absolute deviation, parametric VaR, expected shortfall, and semivariance [36,41,42,43].The critical value of the N/T ratio, at which the divergence occurs, depends on the particular risk measure and on any parameter that the risk measure may depend on (such as the confidence level in expected shortfall).
However, as a manifestation of universality, the power law governing the divergence of the estimation error is independent of the risk measure [36,41,42], the covariance structure of the market [39], and the statistical nature of the underlying process [44].Ultimately, this line of thought led to the discovery of the instability of coherent risk measures [45].

Statistical reasons for the observed instability in portfolio optimization
As mentioned above, for simplicity and clarity of the treatment we do not impose a constraint on the expected return, and only look for the global minimum risk portfolio.This task can be formalized as follows: Given a fixed budget, customarily taken to be unity, given T past measurements of the returns of N assets: x k i , i = 1, . . ., N, k = 1, . . ., T , and given the risk functional F (w • x), find a weighted sum (the portfolio), w • x, ‡ such that it minimizes the actual risk under the constraint that i w i = 1.The central problem is that one does not know the distribution p(x), which is assumed to underly the generation of the data.In practice, one then minimizes the empirical risk, replacing ensemble averages by sample averages: Now, let us interpret the weight vector as a linear model.The model class given by the linear functions has a capacity h, which is a concept that has been introduced by Vapnik and Chervonenkis in order to measure how powerful a learning machine is [24,25,26].(In the statistical learning literature, a learning machine is thought of as having a function class at its disposal, together with an induction principle and an algorithmic procedure for the implementation thereof [46]).The capacity measures how powerful a function class is, and thereby also how easy it is to learn a model of that class.The rough idea is this: a learning machine has larger capacity if it can potentially fit more different types of data sets.Higher capacity comes, however, at the cost of potentially over-fitting the data.Capacity can be measured, for example, by the Vapnik-Chervonenkis (VC-) dimension [24], which is a combinatoric measure that counts how many data points can be separated in all possible ways by any function of a given class.
To make the idea tangible for linear models, focus on two dimensions (N = 2).For each number of points, n, one can choose the geometrical arrangement of the points in the plane freely.Once it is chosen, points are labeled by one of two labels, say "red" and "blue".Can a line separate the red points from the blue points for any of the 2 n different ways in which the points could be colored?The VC-dimension is the largest number of points for which this can be done.Two points can trivially be separated by a line.Three points that are not arranged collinear can still be separate for any of ‡ Notation: bold face symbols are understood to denote vectors.the 8 possible labelings.However, for four points this is no longer the case, since there is no geometrical arrangement for which one could not find a labeling that can not be separated by a line.The VC-dimension is 3, and in general, for linear models in N dimensions, it is N + 1 [46,47].
In the regime in which the number of data points are much larger than the capacity of the learning machine, h/T << 1, a small empirical risk guarantees small actual risk [24].For linear functions through the origin that are otherwise unconstrained, the VCdimension grows with N. In the thermodynamic regime, where N/T is not very small, minimizing the empirical risk does not necessarily guarantee a small actual risk [24].Therefore it is not guaranteed to produce a solution that generalizes well to other data drawn from the same underlying distribution.
In solving the optimizing problem that minimizes the empirical risk, Eq. ( 2) in the regime in which N/T is not very small, portfolio optimization over-fits the observed data.It thereby finds a solution that essentially pays attention to the seeming correlations in the data which come from estimation noise due to finite sample effects, rather than from real structure.The solution is thus different for different realizations of the data, and does not necessarily come close to the actual optimal portfolio.

Overcoming the instability
The generalization error can be bounded from above (with a certain probability) by the empirical error plus a confidence term that is monotonically increasing with some measure of the capacity, and depends on the probability with which the bound holds [48].Several different bounds have been established, connected with different measures of capacity, see e.g.[47].
Poor generalization and over-fitting can be improved upon by decreasing the capacity of the model [25,26], which helps to lower the generalization error.Support vector machines are a powerful class of algorithms that implement this idea.
We suggest that if one wants to find a solution to the portfolio optimization problem in the thermodynamic regime, then one should not minimize the empirical risk alone, but also constrain the capacity of the portfolio optimizer (the linear model).
How can portfolio optimization be regularized?Portfolio optimization is essentially a regression problem, and therefore we can apply statistical learning theory, in particular the work on support vector regression.
Note first that the capacity of a linear model class for which the length of the weight vector is restricted to w 2 ≤ A has an upper bound which is smaller than the capacity of unconstrained linear models [25,26].The capacity is minimized when the length of the weight vector is minimized [25,26].Vapnik's concept of structural risk minimization [48] results in the support vector algorithm [27,28] which finds the model with the smallest capacity that is consistent with the data, that is the model with smallest w 2 .This leads to a convex constrained optimization problem [27,28] which can be solved using linear programming.
5. Regularized portfolio optimization with the expected shortfall risk measure.
While the original Markowitz' formulation [1] measures risk by the variance, many other risk measures have been proposed since.Today, the most widely used risk measure, both in practice and in regulation, is Value at Risk (VaR) [49,50].VaR has, however, been criticized for its lack of convexity, see e.g.[51,52,53], and an axiomatic approach, leading to the introduction of the class of coherent risk measures, was put forward [51].Expected shortfall, essentially a conditional average measuring the average loss above a high threshold, has been demonstrated to belong to this class [54,55,56].
Expected shortfall has been steadily gaining popularity in recent years.The regularization we propose here is intended to cure its weak point, the sensitivity to sample fluctuations, at least for reasonable values of the ratio N/T .
Choose the risk functional F (z) = zθ(z − α β ), where α β is a threshold, such that a given fraction β of the (empirical) loss-distribution over z lies above α β .One now wishes to minimize the average over the remaining tail distribution, containing the fraction ν := 1 − β, and defines the expected shortfall as The term in the sum implements the θ-function, while ν in the denominator ensures normalization of the tail distribution.It has been pointed out [57] that this optimization problem maps onto solving the linear program: We propose to implement regularization by including the minimization of w 2 .This can be done using a Lagrange multiplier, C, to control the trade-off -as we relax the constraint on the length of the weight vector, we can, of course, make the empirical error go to zero and retrieve the solution to the minimal expected shortfall problem.
The new optimization problem reads: The problem is mathematically almost identical to a support vector regression (SVR) algorithm called ν-SVR.There are two differences: (i) the budget constraint is added, and (ii) the loss function is asymmetric.Expected shortfall is an asymmetric version of the ǫ-intensive loss, used in support vector regression, defined as the maximum of {0; |f (x) − y| − ǫ}, where f (x) is the interpolant, and y the measured value (response).
In that sense ǫ measures an allowable error below which deviations are discarded.§ The use of asymmetric risk measures in finance is motivated by the consideration that investors are not afraid of upside fluctuations.However, to make the relationship to support vector regression as clear as possible, we will first solve the more general symmetrized problem, before restricting our treatment to the completely asymmetric case, corresponding to expected shortfall.In addition, one may argue that focusing exclusively on large negative fluctuations might not be advisable even from a financial point of view, especially when one does not have sufficiently large samples.In a relatively small sample it may happen that a particular item, or a certain combination of items, dominates the rest, i.e. produces a larger return than any other item in the portfolio at each time point, even though no such dominance exists on longer time scales.The probability of such an apparent arbitrage increases with the ratio N/T , and when it occurs it may encourage an investor acting on a lopsided risk measure to take up very large long positions in the dominating item(s), which may turn out to be detrimental on the long run.This is the essence of the argument that has led to the discovery of the instability of coherent and downside risk measures [43,45].
According to the above, let us consider the general case where positive deviations are also penalized.The objective function, Eq. ( 7), then becomes min w,ξ,ǫ and additional constraints have to be added to Eqs. ( 8) to (10): This problem corresponds to ν-SVR, a well understood regression method [60], with the only difference that the budget constraint, Eq. ( 10) is added here.In the finance context the associated loss might be called symmetric tail average (STA).Solving the regularized expected shortfall minimization problem, Eqs. ( 7)-( 10) is a special case of solving the regularized STA minimization problem, Eq. ( 11) with the constraints Eqs. ( 8)-( 10) and (12).Therefore, we solve the more general problem first (Section 5.1), before providing, in Section 5.2, the solution to the regularized expected shortfall, Eqs. ( 7)- (10).§ The mathematical similarity between minimum expected shortfall without regularization and the Eν-SVM algorithm [58] was pointed out, but incorrectly, in [59].There is an important difference between the two optimization problems.In Eν-SVM, the length of the weight vector, w , is constrained, which implements capacity control.In the pure expected shortfall minimization, Eq. ( 4), this is not done.Instead, the total budget i w i is fixed.This difference is not correctly identified in the proof of the central theorem (Theorem 1) in [59].

Regularized Symmetric Tail Average Minimization
The solution to the regularized symmetric tail average problem, Eq. ( 11) with the constraints Eqs. ( 8)-( 10) and ( 12), is found in analogy to support vector regression, following [60], by writing down the Lagrangean, using Lagrange multipliers, {α, α * , γ, λ, η, η * }, for the constraints.The solution is then a saddle point, i.e. minimum over primal and maximum over dual variables.The Lagrangean is different from the one that arises in ν-SVR in that it is modified by the budget constraint: where 1 denotes the unit vector of length N. Setting the derivative of the Lagrangian w.r.t.w to zero gives: This solution for the optimal portfolio is sparse in the sense that, due to the Karush-Kuhn-Tucker conditions (see e.g.[61]), only those points contribute to the optimal portfolio weights, for which the inequality constraints in (8), and the corresponding constraints in Eq. ( 12), are met exactly.The solution of w opt contains only those points, and effectively ignores the rest.This sparsity contributes to the stability of the solution.Regularized portfolio optimization (RPO) operates, in contrast to general regression, with a fixed budget.As a consequence, the Lagrange multiplier γ now appears in the optimal solution, Eq. ( 16).Compared to the optimal solution in support vector (SV) regression, w SV , the solution vector under the budget constraint, w RPO , is shifted by γ: Let us now consider the dual problem.
The dual is, in general, a function of the dual variables, which are here {α, α * , γ, λ, η, η * }, although we will see in the following that some of these variables drop out.The dual is defined as D := min w,ξ,ξ * ,ǫ L[w, ξ, ξ * , ǫ, α, α * , γ, λ, η, η * ], and the dual problem is then to maximize D over the dual variables.We can replace the minimization over w by evaluating the Lagrangian at w opt .For that we have to evaluate For the other terms in the Lagrangian, we have to consider different cases: Reason: if equality holds, this is trivially true, and if the inequality holds strictly then L can be minimized by setting ǫ = 0.
Similarly, for the other constraints (the notation ( * ) means that this is true for variables with and without the asterisk): By a similar argument, the term γ in Eq. ( 14) disappears in the Dual.Altogether we have that either D = −∞, or and and α Note that the variables ξ T k=1 We can analytically maximize over γ and obtain for the optimal value The optimal projection (= optimal portfolio) is given by For N → ∞ the second term vanishes and the solution is the same as the the solution in support vector regression.Note that the kernel-trick (see e.g.[47]), which is used in support vector machines to find nonlinear models hinges on the fact that only dot products of input vectors appear in the support vector expansion of the solution.As a consequence of the budget constraint, one can no longer use the kernel-trick (compare Eq. ( 27)).As long as we disregard derivatives, this is not a problem for portfolio optimization.Keep in mind, however, that the budget constraint introduces this otherwise undesirable property.Support vector algorithms typically solve the dual form of the problem (for a recent survey see [62]), which is in our case given by max For N → ∞ the problem becomes identical to ν-SVR, which can be solved by linear programming, for which software packages are available [63].For finite N, it can still be solved with existing methods, because it is quadratic in the α k 's.Solvers such as the ones discussed in [64] and [62] can be used, but have to be adapted to this specific problem.
The regularized symmetric tail average minimization problem (Eq.( 11) with the constraints Eqs. ( 8)-( 10) and ( 12)) is, as we have shown here, directly related to support vector regression which uses the ǫ-insensitive loss function.The ǫ-insensitive loss is stable to local changes for data points that fall outside the range specified by ǫ.This point is elaborated in Section 3 in [60], and relates this method to robust estimation of the mean.It can also be extended to robust estimation of quantiles [60] by scaling of the slack variables ξ k by µ and ξ * k by 1 − µ, respectively.This scaling translates directly to the portfolio optimization problem, which is an extreme case: downside risk measures penalize only loss, not gain.The asymmetry in the loss function corresponds to µ = 1.

Regularized expected shortfall.
By this final change we arrive at the regularized portfolio optimization problem, Eqs. ( 7)-( 10), which we originally set out to solve.This is now easily solved in analogy to the previous paragraphs: the slack variables ξ * k disappear, together with the respective Lagrange multipliers which enforce constraints, including α * k .The optimal solution is now with The dual problem is given by max which, like its symmetric counterpart, Eq. ( 28), can be solved by adjusting existing algorithms.
The formalism provides a free parameter, C, to set the balance between the original risk function and the regularizer.Its choice may depend on a number of factors, such as the investors time horizon, the nature of the underlying data, and, crucially, on the ratio N/T .Intuitively, there must be a maximum allowable value C max (N/T ) for C, such that when one puts more emphasis on the data, C > C max (N/T ), then over fitting will occur with high probability.It would be desirable to know an analytic expression for (a bound on) C max (N/T ).In practice, cross-validation methods are often employed in machine learning to set the value of C. Those methods are not free of problems (see, for example, the treatment in [65]), and the optimal choice of this parameter remains an open problem.

Regularization corresponds to portfolio diversification.
Above, we have controlled the capacity of the linear model by minimizing the L2 norm of the portfolio weight vector.In the finance context, minimizing corresponds roughly to maximizing the effective number of assets, N eff , i.e. to exerting a pressure towards portfolio diversification [66].We conclude that diversification of the portfolio is crucial, because it serves to counteract the observed instability by acting as a regularizer.
Other constraints that penalizes the length of the weight vector could alternatively be considered as a regularizer, in particular any Lp norm.The budget constraint alone, however, does not suffice as a regularizer, since it does not constrain the length of the weight vector.Adding a ban on short selling, w i ≥ 0, to the budget constraint, i w i = 1, limits the allowable solutions to a finite volume in the space of weights and is equivalent to requiring that i |w i | ≤ 1.It thereby imposes a limit on the L1 norm, that is on the sum of the absolute amplitudes of long and short positions.
One may argue that it may be a good idea to use the L1 norm instead of the L2 norm, because that may make the solution sparser.However, the L1 norm has a tendency to make some of the weights vanish.Indeed, it has been shown that in the orthonormal design case (using the variance as the risk measure) an L1 regularizer will set some of the weights to zero, while an L2 regularizer will scale all the weights [29].The spontaneous reduction of portfolio size has also been demonstrated in numerical simulations [67]: as one goes deeper and deeper into the regime where T is significantly smaller than N, under a ban on short selling, more and more of the weights will become zero.The same "freezing out" of the weights has been observed in portfolio optimization [68] as an empirical fact.
It is important to stress that the vanishing of some of the weights does not reflect any structural property of the objective function, it is just a random effect: as clearly demonstrated by simulations [67], for a different sample a different set of weights vanishes.The angle of the weight vector fluctuates wildly from sample to sample.(The behavior of the solutions is similar for other limit systems as well.)This means that the solutions will be determined by the limit system and the random sample, rather than by the structure of the market.So the underlying instability is merely "masked", in that the solutions do not run away to infinity, but they are still unstable under sample fluctuations when T is too small.As it is certainly not in the interest of the investor to obtain a portfolio solution which sets weights to zero on the basis of unreliable information from small samples, the above observations speak strongly in favor of using the L2 norm over the L1 norm.

Conclusion
We have made the observation that the optimization of large portfolios minimizes the empirical risk in a regime where the data set size is similar to the size of the portfolio.In that regime, a small empirical risk does not necessarily guarantee a small actual risk [24].In this sense naive portfolio optimization over-fits the data.Regularization can overcome this problem by reducing the capacity of the considered model class.
Regularized portfolio optimization has choices to make, not only about the risk function, but also about the regularizer.Here, we have focussed on the increasingly popular expected shortfall risk measure.Using the L2 norm as a regularizer leads to a convex optimization problem which can be solved with linear programming.We This point has been made independently by [17].
have shown that regularized portfolio optimization is then a variant of support vector regression.The differences are an asymmetry, due to the tolerance to large positive deviations, and the budget constraint, which is not present in regression.
Our treatment provides a novel insight into why diversification is so important.The L2 regularizer implements a pressure towards portfolio diversification.Therefore, from a statistical point of view, diversification is important as it is one way to control the capacity of the portfolio optimizer and thereby to find a solution which is more stable, and hence meaningful.
In summary, the method we have outlined in this paper allows for the unified treatment of optimization and diversification in one principled formalism.It shows how known methods from modern statistics can be used to improve the practice of portfolio optimization.
* ) k = 0. Reason: If the inequality holds strictly then L can be minimized by ξ ( * ) k = 0.If equality holds then it is trivially true.