Seven Proofs for the Subadditivity of Expected Shortfall

Subadditivity is the key property which distinguishes the popular risk measures Value-at-Risk and Expected Shortfall (ES). In this paper we offer seven proofs of the subadditivity of ES, some found in the literature and some not. One of the main objectives of this paper is to provide a general guideline for instructors to teach the subadditivity of ES in a course. We discuss the merits and suggest appropriate contexts for each proof.With different proofs, different important properties of ES are revealed, such as its dual representation, optimization properties, continuity, consistency with convex order, and natural estimators.


Introduction
In a course on Quantitative Risk Management, an instructor inevitably has to discuss Value-at-Risk (VaR) and Expected Shortfall (ES) as the two standard risk measures to determine capital requirements for a nancial institution. The reader is referred to [17] for recent extensive debates on "VaR versus ES in banking regulation" as well as [29] for broader background material.
Following [3], the key property that distinguishes any coherent risk measure, such as ES, from VaR is subadditivity. More precisely, a subadditive risk measure ρ satis es that for any two risks X and Y, ρ(X + Y) ≤ ρ(X) + ρ(Y) always holds. This property is closely related to questions on portfolio diversi cation and risk aggregation; more detailed theory and nancial interpretation of subadditivity can be found in [10]¹. Note that here, and throughout the paper, losses are accounted for as positive values.
Despite its relevance, it is somewhat surprising that many academics and risk professionals do not know explicitly how to prove that ES is subadditive, although they are all aware of the validity of the statement. In most of the main-stream textbooks used in actuarial science, quantitative nance, or quantitative risk management, a proof of this property is either (i) split into several disconnected parts, (ii) reliant on advanced results in modern probability or statistics, (iii) too mathematically involved for a typically broad class of students attracted to a course in the above elds, or (iv) even skipped.
Indeed, we shall see that the subadditivity of ES is not a trivial property; it relates to the dependence structure between random variables. Some mathematical proofs found in the literature can be quite involved. In view of the growing importance of ES for regulation (see recent regulatory documents [4][5][6] and [24]), it is clear to the authors that concise proofs of this property should be clearly conveyed to academics and practi-tioners in the quantitative elds of nance and risk management. Moreover, di erent proofs reveal di erent properties of ES, each with their own speci c relevance for practice.
We rst introduce some basic notation. Let (Ω, F, P) be an atomless probability space² Throughout, all random variables are de ned on (Ω, F, P) and all probability measures are de ned on (Ω, F). Let L be the set of all random variables, L the set of all integrable random variables and L ∞ the set of all (essentially) bounded random variables; for a de nition of essential supremum, see for instance Billingsley [7,p.241].
The main objective is to provide a variety of proofs of the following theorem.
In Section 2 we discuss some general issues and common basic lemmas related to the proofs of Theorem 1.1 and in Section 3 we present seven proofs based on di erent techniques. Each proof is self-contained, and when necessary, we refer to classic results in a respective eld of study. Although most of the intermediate results are known in the literature, we give an elementary proof wherever possible so that our proofs can be directly used by an instructor. With di erent proofs, we reveal di erent important properties of ES such as its dual representation, optimization properties, continuity, consistency with convex order, and natural estimators. We comment on merits of the proofs, and suggest appropriate contexts within which to use them.
The class of spectral risk measures in [2] can be written as a continuously weighted average of ES; see [37] and [26]. Therefore, by showing the subadditivity of ES, one directly obtains the subadditivity of any spectral risk measures. Remark 1.3. The question "who was the rst to show that ES is subadditive?" has no de nite answer, since the introduction of ES came long after the theory of Choquet integrals (including ES as a special case) was established. An implicit result close to the subadditivity of ES in a discrete probability space is Proposition 10.2.5 of [22], based on a lemma dating back to [8]. The subadditivity theorem of Choquet integrals in a general probability space is given in Chapter 6 of [11].
Throughout, denote (x)+ = max{x, } for x ∈ R. N is the set of positive integers. For a random variable X, we denote by F X the distribution function or simply the distribution of X (under P). Note that for p ∈ ( , ), F − X (p) = inf{x ∈ R : F X (x) ≥ p} = VaRp(X); both the notation VaRp(X) and the notation F − X (p) will be used whenever convenient; for detailed properties of the latter, see [16]. All expectations ("E") are considered under P unless a superscript indicating another probability measure ("E Q ") is present.

General discussion . Basic properties and lemmas
In this section, we list some basic properties and short lemmas on random variables and ES, which will be used across di erent proofs in Section 3. All the properties in this section naturally appear in a quantitative course which covers VaR and ES, and hence they create no extra burden in the teaching of such a course.
A risk measure ρ : it is straightforward to check that both the risk measures VaRp and ESp are law-determined, monotone and translation-invariant. These properties will be frequently used throughout the paper. The lemma below yields the foundation of many results in probability theory, such as Sklar's theorem in the study of copulas.
Proof. This is a classic result; see Rüschendorf [33,Proposition 1.3] where the construction is referred to as distributional transform. We give the construction of U X below. If F X is continuous, taking U X = F X (X) would su ce. Generally, let V be a U[ , ] random variable independent of X, and write where F X (x−) denotes the left-limit of the function F X at x ∈ R. The interested reader can check that U X is U[ , ]-distributed and X = F − X (U X ) almost surely.
Throughout the rest of the paper, for any random variable X, let U X be a U[ , ] random variable such that X = F − X (U X ). The two lemmas below give basic ES formulas in terms of VaR. Note that Lemma 2.3 is based on Lemma 2.2, which is further based on Lemma 2.1. For each proof in Section 3, we will indicate whether any of the Lemmas 2.1-2.3 is required.

Lemma 2.3.
For any X ∈ L and p ∈ ( , ), Proof. From Lemma 2.2, where the convergence is justi ed by the Monotone Convergence Theorem. It follows that ESp(X), ESp(Ȳ) = ESp(Y) and X + Y ≤X +Ȳ. By (i) and (ii), (1.1) holds for all X, Y ∈ L bounded from below. Therefore

Seven proofs of the subadditivity of ES
In the following, seven proofs are ordered by their (perceived or real) level of technical di culty. Each proof is self-contained and the reader does not need to follow the given order. The proofs may require some of the Lemmas 2.1-2.3. Proofs 5-7 further require some classic results from probability or statistics. To the best of our knowledge, Proofs 1 and 4 are not clearly given in the literature, whereas the others can be found, be it in slightly di erent forms.

. A proof based on comonotonicity
This proof requires Lemma 2.1. In the following, denote by Bp the set of Bernoulli( − p) random variables and write A X = I {U X ≥p} ∈ Bp.
From Lemma 3.1, That is, ESp is the supremum of the additive maps X → −p E[XB] over B ∈ Bp, and hence is subadditive.  (2)]. Historically, this result dates back to [21] and [19], and it is now common knowledge in quantitative risk management at the graduate level; if it is taken for granted, then Lemma 3.1 can be omitted and the proof can be further shortened. (3.1) gives a special form of the coherence representation of ESp along the lines of [3]; a similar formulation is see for instance Denuit et al. [12,Remark 2.4.8]. This representation links ESp to stress testing. Indeed note that in practice p is close to so that − p is typically very small. In the case of Economic Capital, p = .
whereas for regulatory purposes p typically lies in the range of . to . .
The merits of Proof 1 include: It is by far the shortest proof; it reveals a coherence representation of ESp in (3.1), and it connects naturally to comonotonicity and the theory of copulas. We would recommend Proof 1 in a course where the concepts of copulas, comonotonicity, or the coherence representation of ES are points of interest.

. A proof based on an optimization property of VaR and ES
This proof requires Lemma 2.2. It is based on a few extra lemmas but it requires no additional knowledge of modern probability theory.
where the last equality is due to integration by parts. Write t = VaRp(X). For t > t , we have As a consequence, For t < t , we have From the de nition of VaRp(X), we have that for x ∈ (t , t ), F X (x) < p and hence − F X (x) > − p. As a consequence, In summary, t ∈ argmin t∈R f (t), that is, (3.2) holds.
Proof. This follows directly from Lemmas 2.2 and 3.2.
Note that (x + y)+ ≤ (x)+ + (y)+ for all x, y ∈ R. Therefore, by writing t = t + t , where the last equality follows from Lemma 3.3.  [2] and [32]. Based on these optimization properties, a clear interpretation of VaRp being a cost-e cient threshold and ESp being the corresponding minimal cost is given in [12] and [25]. A straightforward geometric proof of Lemma 3.3 can be found in Dhaene et al. [15,Theorem 1]. The main idea in Proof 2 is also used to show the subadditivity of the Haezendonck-Goovaerts risk measure; see [20].
The merits of Proof 2 include: It is real analysis based without involving techniques from modern probability theory; it reveals the important optimization properties of VaRp and ESp in Lemmas 3.2 and 3.3, and it is easy to understand for undergraduate students. We would recommend Proof 2 in a course where the target audience is at the undergraduate level, or the optimization properties in Lemmas 3.2 and 3.3 are points of interest.

. A proof based on generalized indicator functions
This proof requires Lemma 2.3. We rst need some basic results about generalized indicator functions.
In the following, for p ∈ ( , ), X ∈ L ∞ and x ∈ R, de ne a generalized indicator function Lemma 3.5. For p ∈ ( , ), X ∈ L ∞ and x ∈ R, the following hold: Proof.
(i) It su ces to verify that ≤ P(X ≤ VaRp(X)) − p ≤ P(X = VaRp(X)), which directly follows from On the other hand, − P(X ≤ VaRp(X)) = − P(X < VaRp(X)) ≥ − p, (iii) If P(X = VaRp(X)) = , then by Lemma 2.3 and noting that P(X ≤ VaRp(X)) = p, If P(X = VaRp(X)) > , then    The merit of Proof 3 is that it is based on standard real analysis without involving techniques from more advanced probability theory, and hence it is accessible to undergraduate students. An important aspect, also relevant for practice, is the special treatment of the case when the loss random variable X has an atom at VaRp(X). In addition, the proof is fairly simple if only the case of continuous distributions is of interest.
We would recommend Proof 3 in a course where the target audience is at the undergraduate level, or the instructor intends only to teach the case of continuous distributions but not a complete proof. Proof 3 shares some similar argument with Proof 1. Though it can be viewed as elementary, it is technically more involved than Proof 1.

. A proof based on discrete approximation
This proof requires Lemma 2.1. In the following, for n ∈ N, we say that a random variable X is n-discrete if it takes values in a set of at most n points each with probability /n or a multiple of /n. We say that a random vector (X, Y) is n-discrete if it takes values in a set of at most n vectors each with probability /n or a multiple of /n. Note that (X, Y) being n-discrete implies that X and Y are n-discrete but not vice-versa.
This proof contains two steps: we rst show that Theorem 1.1 holds for an n-discrete random vector, and then approximate a general random vector by n-discrete random vectors. The second step involves convergence of random variables and it is more technical than the rst step. Proof. We consider three cases: (i) p is a multiple of /n; (ii) p is rational, and (iii) p is general.
(i) Suppose that np ∈ N. Since (X, Y) is n-discrete, we can divide the sample space Ω into a partition Ω , . . . , Ωn, each with probability /n, such that for i = , . . . , n, (X, Y) takes a xed value, denoted by (x i , y i ), on Ω i . Note that for k = , . . . , n and q ∈ ((k − )/n, k/n], where x [i] is the i-th largest element in the multiset {x , . . . , xn}. Write p = − m/n for some m ∈ N. One can directly calculate ESp(X), which is the average of the largest m elements in the multiset {x , . . . , xn}, that is, (ii) Suppose that p is a rational number. Write p = k/m for k, m ∈ N and k < m. Note that p = (kn)/(mn) and (X, Y) is also mn-discrete. Therefore, from (i), we have that ESp(X + Y) ≤ ESp(X) + ESp(Y). (iii) For a general real number p, from the de nition of ESp, it follows immediately that p → ESp(X) is a continuous mapping. Therefore, we can nd rational numbers p , p , . . . such that p k → p as k → ∞ and for k = , , . . . , ESp k (X + Y) ≤ ESp k (X) + ESp k (Y). Taking a limit as k → ∞ we obtain ESp(X + Y) ≤ ESp(X) + ESp(Y).
Proof. It su ces to note that VaRq(X k ) ↑ VaRq(X) for almost every q ∈ ( , ), a basic property of the quantile function; see Resnick [31, Proposition 0.1] for a proof. As ESp is an integral of VaRq, the Monotone Convergence Theorem implies ESp(X k ) ↑ ESp(X) as k → ∞.

Lemma 3.8. Suppose that the random variables X and Y are n-discrete for a positive integer n. Then for p ∈ ( , ), ESp(X + Y) ≤ ESp(X) + ESp(Y).
Proof. Without loss of generality we can assume X, Y ≥ since they are both bounded, and ESp is translationinvariant. Denote by {(x , y ), . . . , (xm , ym)} the range of (X, Y); obviously m ≤ n . Note that P((X, Y) = (x i , y i )) may not be a rational number for some i = , . . . , m and hence Lemma 3.6 cannot be directly applied.
. . , m. Since our probability space is atomless, for i = , . . . , m, we can It is clear that X k ↑ X and Y k ↑ Y in probability. Moreover, for each k ∈ N (X k , Y k ) is m k -discrete for some m k ∈ N since the probability mass function of (X k , Y k ) takes values in Q. By Lemma 3.6 and the monotonicity of ESp, we have Since X k + Y k ↑ X + Y in probability as k → ∞, by taking a limit in k → ∞ and using Lemma 3.7, the above equation yields ESp(X + Y) ≤ ESp(X) + ESp(Y).

Theorem 1.1, Proof 4. Let
. It is obvious that X k and Y k are k -discrete, and X k ↑ X and Y k ↑ Y in probability. Lemma 3.8 yields As a consequence, with Lemma 3.7, we obtain ESp(X + Y) ≤ ESp(X) + ESp(Y). The merits of Proof 4 include: It is based on standard undergraduate level techniques in probability theory such as discrete approximation and convergence theorems; it reveals an intuitive explanation of ES being subadditive through discrete random variables, and it is easy to understand for students with good combinatorial and analytical skills.
We would recommend Proof 4 in a course where discretization of distribution functions or the continuity of risk measures is a point of interest, or the instructor intends to highlight intuition in the discrete case but not give a complete proof.

. A proof based on the law of large numbers for order statistics
In this section, for a sequence of random variables X , X , . . . , we denote by X [i,n] the i-th largest value in {X , . . . , Xn}, that is, the i-th order statistic up to the n-th observation.
For any two random variables X, Y ∈ L ∞ , let (X , Y ), (X , Y ), . . . be a sequence of iid random vectors, identically distributed as (X, Y) and write Z i = X i + Y i for i = , , . . . . Then m i= Z [i,n] = max{Z i + · · · + Z im : (i , . . . , im) ∈ A n m } = max{X i + · · · + X im + Y i + · · · + Y im : (i , . . . , im) ∈ A n m } ≤ max{X i + · · · + X im + Y j + · · · + Y jm : (i , . . . , im) ∈ A n m , (j , . . . , jm) ∈ A n m } = max{X i + · · · + X im : (i , . . . , im) ∈ A n m } + max{Y j + · · · + Y jm : By setting m = n − p , we have Taking n → ∞, by Lemma 3.9, we obtain ESp(X + Y) ≤ ESp(X) + ESp(Y). The merits of Proof 5 include: It requires the law of large numbers for order statistics (Lemma 3.9 above), which gives also a natural non-parametric estimator of ESp(X) in (3.4); in a context where statistical estimation of ESp is relevant, this proof would t in naturally. We would recommend Proof 5 in a course where statistical inference is a point of interest, or the students have a solid statistical background. Proof 5 shares some similar argument with Proof 4. Whereas in Proof 4 the discrete case can be solved fairly elementarily, the general case needs a non-trivial probabilistic limit argument. For Proof 5, the general case can immediately be treated by a more powerful limit theorem from the realm of the theory of linear combinations of order statistics.

. A proof based on convex order
For X, Y ∈ L , we say that X is smaller than Y in convex order, denoted by X ≺cx Y, if for all convex functions f , (3.5) whenever both sides of (3.5) are well-de ned.
Proof. This is a classic result in convex order; see Shaked Proof. This is a classic result on comonotonicity; see for instance Dhaene et al. [13,Theorem 7] or Rüschendorf [33,Theorem 3.5] for a proof. Proof. This is another classic result on comonotonicity. First, note that for any non-decreasing function h and a U[ , ]-distributed random variable U, VaRp(h(U)) = h(p). Then where the second-last equality comes from the fact that f (VaRu(Z)) = VaRu(f (Z)) for almost every u ∈ [ , ] since f is non-decreasing. Remark 3.6. The idea of this proof is presented in [36] and is also used in the review paper [14]. Lemma 3.10 dates back to [27] in the context of stochastic dominance. Lemma 3.11 was rst shown in [30]. The fact that ESp is comonotone additive is part of the properties of Choquet integrals, and it can be found in [39] and [11]. The current form of Lemma 3.12 is given in Kusuoka [26,Proposition 20].
The merit of Proof 6 is that it naturally connects to the concepts of convex order, comonotonicity and comonotonic additivity, all of which are important modern concepts in quantitative risk management. This proof requires additional techniques in probability theory, and it may not be suitable for an audience without the corresponding knowledge.
We would recommend Proof 6 in an advanced course where convex order, comonotonicity and comonotonic additivity are points of interest or available as preliminaries.

. A proof based on the coherence representation of ES
This proof requires Lemma 2.3 and knowledge on Radon-Nikodym derivatives of probability measures. It is probably the most mathematically advanced among all proofs in this paper.  For any ψ ∈ L ∞ , ≤ ψ ≤ , E Q [ψ] ≤ α, by de nition of ψ , one has (ψ − ψ)(ϕ − c) ≥ almost surely. Therefore, Lemma 3.14. For p ∈ ( , ), ESp has the representation where Qp is the set of probability measures Q on (Ω, F) with Radon-Nikodym derivative dQ/dP ≤ /( − p).
Proof. De ne a mapping ρ : L ∞ → R by We aim to show that ESp = ρ. First, for X > , de ne a probability measure P such that dP/dP = X/E [X]. Then for X ∈ L ∞ , By Lemma 3.13, the above supremum is attained by where κ ∈ [ , ] is such that E[W ] = − p, that is κ = P(X≤VaRp(X))−p P(X=VaRp(X)) if P(X = VaRp(X)) > , and κ can take any value in [ , ] if P(X = VaRp(X)) = . Therefore, From Lemma 2.3 we have ρ(X) = ESp(X). For arbitrary X ∈ L ∞ , ρ(X) = ESp(X) follows from the above result by noting that both ρ and ESp are translation-invariant. The merit of Proof 7 is that Lemma 3.14 reveals the coherence representation of ESp, a fundamental property of ESp, connecting to the dual representation of coherent risk measures in the most general form. This proof requires additional techniques in modern probability theory and statistics, and it may not suitable for an audience without the corresponding knowledge. A further merit is that the link between the Neyman-Pearson lemma and mathematical nance can also be found in the concept of quantile hedging in incomplete markets; see Föllmer and Schied [18, Section 8.1]. We would recommend Proof 7 in an advanced course where the axiomatic theory of coherent risk measures is a point of interest. Proof 7 can be viewed as a more comprehensive and advanced version of Proof 1.

Overall comments
From the seven di erent proofs we o er in this paper, it becomes clear that the subadditivity question for ES is not a trivial one. It is both mathematically challenging as well as practically relevant. Each proof has its own merit and suitable context. In summary, Proof 1 is the shortest and it reveals a special form of the coherence representation of ES; Proofs 2 and 3 require the least knowledge on probability theory, with Proof 2 being shorter and connected to an optimization property of VaR and ES, and Proof 3 being convenient if only the case of continuous distributions is of interest; Proof 4 explains intuitively the subadditivity of ES in discrete cases and reveals a basic continuity of ES; Proofs 5, 6 and 7 require specialized knowledge and they respectively reveal natural estimators of ES, its consistency with convex order, and the coherence presentation of coherent risk measures. When teaching the subadditivity of ES in a course, the instructor is advised to choose a proof which ts best the knowledge of the audience and the content of the course.