Tight tail probability bounds for distribution-free decision making

Chebyshev's inequality provides an upper bound on the tail probability of a random variable based on its mean and variance. While tight, the inequality has been criticized for only being attained by pathological distributions that abuse the unboundedness of the underlying support and are not considered realistic in many applications. We provide alternative tight lower and upper bounds on the tail probability given a bounded support, mean and mean absolute deviation of the random variable. We obtain these bounds as exact solutions to semi-infinite linear programs. We leverage the bounds for distribution-free analysis of the newsvendor model, monopolistic pricing, and stop-loss reinsurance. We also exploit the bounds for safe approximations of sums of correlated random variables, and to find convex reformulations of single and joint ambiguous chance constraints that are ubiquitous in distributionally robust optimization.


Introduction
Chebyshev's inequality provides an upper bound on the tail probability of a random variable using only its first two moments (Bienaymé, 1853;Chebyshev, 1867). Due to this distributionfree nature, Chebyshev's inequality is widely applicable. Let the ambiguity set P (µ,σ) contain all distributions with a given mean µ and variance σ 2 , and let the random variable X follows some distribution P ∈ P (µ,σ) . Chebyshev's inequality (the one-sided version also known as Cantelli's inequality) then follows from the worst-case distribution that solves the optimization problem This inequality is tight, meaning it cannot be improved in general. However, Chebyshev's inequality can be criticized for only being attained by pathological distributions that abuse the unboundedness of the underlying support. Indeed, the worst-case distribution takes only values on the points µ − σ 2 /(t − µ) and t (with probabilities (t − µ) 2 /(σ 2 + (t − µ) 2 ) and σ 2 /(σ 2 + (t − µ) 2 ), resp.), which can be regarded unrealistic (Van Parys et al., 2016b). In many practical applications some information on the minimum and maximum of uncertain parameters is known. This is particularly true for OR applications that consider uncertain parameters that are known to be nonnegative, such as inventory management, service operations, appointment scheduling and pricing mechanisms. We remark that a tight tail probability bound under knowledge of the mean, variance and a bounded support was derived by De Schepper and Heijnen (1995). Next to restricting the support, a second potential improvement of Chebychev's inequality concerns robustness for outliers. Whereas the (sample) variance is greatly influenced by outliers, the mean absolute deviation (MAD) is less sensitive for large deviations from the mean, and hence a potentially more robust measure of statistical dispersion in data. We therefore propose to replace variance with mean absolute deviation (MAD). Using the MAD comes with additional advantages. We show that the set of extremal distributions for which the derived tail bounds are tight is more varied than a single pathological distribution: it consists of an infinite number of mixed distributions instead. Second, because the MAD is a linear function, it allows for elegant closed-form bounds, a feature we shall leverage when applying the bounds to domain-specific OR questions.
In obtaining the robust tail bounds, we need to solve sup X∼P∈P (µ,b,d) with P (µ,b,d) the ambiguity set that contains all distributions with a given mean µ, support [0, b] and mean absolute deviation d. Optimization problem (2) is a semi-infinite linear optimization problem (LP) that is reminiscent of those arising in moment problems, and typically does not allow for an analytic (closed-form) solution. Using the MAD based ambiguity set P (µ,b,d) , the dual program to (2) can be solved explicitly. While comparable dual programs are often solvable as semidefinite or second-order conic programs (see, e.g., Xin and Goldberg (2013); Natarajan and Zhou (2007); Perakis and Roels (2008); Natarajan et al. (2017); Das et al. (2018)), analytic solutions as in our case are typically hard to attain.
The solution of (2) gives a generic tight upper bound on the tail probability of all random variables with a given bounded support, mean and MAD. This new robust bound is of a similar simplicity and generality as the original Chebyshev inequality, and can therefore be used widely in various applications. The worst-case distribution that solves (2), is however more complicated than the two-point distribution of the Chebyshev inequality, and is a mixed distribution with up to three discrete parts and one continuous part. We also derive three more tail probability bounds: the tight lower bound under P (µ,b,d) ambiguity and the tight upper and lower bounds under P (µ,b,d,β) ambiguity, where we also condition on the skewness P (X ≥ µ) = β.
Recent advances in Distributionally Robust Optimization (DRO) also exploit ambiguity sets in terms of bounded support, mean and MAD to obtain closed-form expressions for stochastic quantities such as the minimum and maximum expectation of a convex function Postek et al. (2018); Ghosal and Wiesemann (2020). These closed-form expressions are then used to solve minmax and maxmin optimization problems that arise naturally in decision making under uncertainty. Postek et al. (2018) specifically use results from Ben-Tal and Hochman (1972) on tight upper and lower bounds on the expectation of convex function of a random variable. This paper presents the first closed-form solution for the combination of P (µ,b,d) (or P (µ,b,d,β) ) constraints and a non-convex objective function. This proof method is not restricted to the indicator function, and could potentially work for a much larger class of (measurable) functions.
The first part of this paper revolves around the tail probability bounds. After the primaldual proofs we conclude that part with extensive numerical demonstrations and a comparison with other classical bounds. We expect the bounds to be useful in many domains. The second part deals specifically with the use of these new bounds in the OR domain. Although numerical aspects are important, the focus is on utilizing the structural properties of the closed-form inequalities. For optimization problems that have tail probabilities as input, this can lead to closed-form or tractable solutions, which would remain out of reach without the derived tight inequalities.
We first apply the robust bounds for distribution-free analysis of three classical models that can be subjected to minmax or maxmin optimization. We start with the newsvendor model, the basic single-period inventory model that searches for the optimal order quantity in view of overage and underage costs. Under full information, the optimal order quantity corresponds to a specific quantile of the demand. Scarf (1958) studied the situation when only the mean and variance of demand are known, and derived a robust order quantity as the solution of a minmax optimization problem, where the decision maker takes the best decision under the worst possible circumstances in light of mean-variance ambiguity. Scarf's distribution-free analysis is one of the first forms of DRO, and has been a source of inspiration for many OR studies. Technically, it requires computing upper bounds via a linear program on the expected value of a convex function E[h(X)] for a random variable X with mean µ and variance σ 2 . We shall apply our tail probability bounds for a comparable analysis, with P (µ,b,d,β) ambiguity.
We then turn to the monopolistic pricing problem, where a seller seeks to maximize profit when selling a single object to a buyer who is willing to pay some unknown value X. Traditionally, it is assumed that X is drawn from some distribution that is known to the seller, so that the seller can set the optimal price. When there is a single buyer, the optimal strategy is to post a fixed price p that maximizes the expected profit pP(X > p); see Riley and Zeckhauser (1983) and Myerson (1981). We apply the derived tail probability bounds to the robust variant of the monopolistic pricing problem, where instead of knowing the distribution, the seller only has partial information contained in P (µ,b,d) . The seller then becomes a maxmin decision maker who chooses the price that maximizes the worst-case expected profit.
The third classical model occurs in stop-loss reinsurance. An insurance company faces a claim of size X, which it pays up to a predefined level z, while the reinsurance company covers the remainder (up to a predefined maximum m). We study this problem from both the insurer's and reinsurer's perspective, the latter of which requires an extension of our tail probability bound. Specifically, we derive an upper bound for the expected payment of the reinsurer, which is neither a convex or indicator type function. The three classical models come from different corners of Operations Research and Management Science, but have in common that important characteristics can be expressed in distribution functions, which facilitates a direct application of the tail probability bounds for distribution-free analysis. The models also show that the bounds lend themselves to both minmax and maxmin decision problems. We emphasize that the models have been chosen somewhat arbitrarily, and there are many other OR questions where tail probability bounds under mean-MAD constraints can prove useful.
As alluded to above, our bounds are part of a larger research effort that deals with exploiting the tractability of mean-MAD constraints for enhancing the state-of-affairs in DRO. To highlight the connection of our work with DRO, we extend our tail probability bound to sums of random variables. We demonstrate the multivariate tail bound with a risk management example that considers portfolios with multiple risky assets. The bound allows for arbitrary dependence structures and hence is particularly suitable for, e.g., credit risk and insurance problems. Additionally, we apply our tail probability bounds to find convex reformulations of several types of ambiguous chance constraints containing a single random variable. Specifically, we consider a general convex constraint with right-hand side uncertainty, and a constraint that is bilinear in the decision variable and the uncertain parameter. This allows optimization problems with such ambiguous chance constraints to be solved through conventional solution methods from continuous optimization under mean-MAD ambiguity. These theoretical results on ambiguous chance constraints are illustrated through an example from radiotherapy optimization, in which the dose of radiation delivered to the tumor is to be maximized, under a probabilistic constraint on the dose of radiation delivered to the surrounding healthy tissue.
Outline and contributions. This introduction largely revolves around bounds of tail probabilities through mean-MAD ambiguity and their applications in probability theory, stochastic OR and optimization. For each of these applications, we will discuss more details and related studies in the appropriate sections. In Section 2 we present, prove and illustrate the tail probability inequalities. We provide tight upper and lower bounds for the probability that a random variable exceeds a specified threshold under a known support, mean and mean absolute deviation.
In Section 3 we use these bounds for distribution-free analysis of classic OR problems. Specifically, we study the newsvendor model in Section 3.1, the monopoly pricing model in Section 3.2, and the stop-loss reinsurance model in Section 3.3. To further demonstrate the large scope of our bounds, and to highlight the connection with DRO, we present in Section 4 two more perspectives. We extend our tail probability bound to sums of random variables in Section 4.1 and illustrate this extension in Section 4.2 with an insurance problem that considers a portfolio with multiple risks. Section 4.3 discusses the application of tail probability bounds to reformulate ambiguous chance constraints in distributionally robust optimization. This application is illustrated through an example from radiotherapy optimization in Section 4.4.

Novel tail probability bounds
In this section we derive novel bounds for the probability P(X ≥ t) that a random variable X with given support, mean and MAD exceeds t. We obtain the bounds by solving the semi-infinite where we maximize over a set of probability measures with the stated characteristics, i.e., with B the Borel σ-algebra of the closed set [0, b], and µ, b, d ∈ R + are parameters that describe all known properties of the distribution. We solve the linear programs and present the novel bounds in Section 2.1. We then compare the novel bounds with some existing bounds in Section 2.2, and briefly discuss the existing literature on generalized versions of Chebyshev's inequality.

Tight lower and upper bounds
Since P is a probability measure it should satisfy the constraint x∈[0,b] dP(x) = 1. Moreover, this probability measure should satisfy the mean and MAD constraints x∈ [0,b] xdP(x) = µ and Under these constraints, we solve the semi-infinite linear program (3), which gives our first main result.
Theorem 1. Consider a random variable X with a distribution P in P (µ,b,d) . Then, with τ 1 and τ 2 given by Proof. Let M + be the set of non-negative measures defined on the measurable space ([0, b], B).
We need to solve A useful fact is that the semi-infinite LP (6) can be reduced to an equivalent finite LP that yields the same optimal value. In particular, when certain Slater conditions hold for the moment constraints (i.e., the moment vector should lie in the interior of the set of feasible moments) then solving the primal semi-infinite LP is equivalent to solving its finite dual counterpart; see, e.g., Isii (1962) or Popescu (2005). Moreover, the Richter-Rogosinski Theorem (see, e.g., Rogosinski (1958); Shapiro et al. (2009), or Han et al. (2015) states that there exists an extremal distribution for problem (6) with at most three support points. While finding these points in closed form is typically not possible for general semi-infinite problems, we next show that this is possible for the problem at hand by resorting to the dual problem and exploiting the specific shape of the dual constraints that is imposed by the MAD constraint x |x − µ|dP(x) = d.
Consider the dual of (6), The constraint of the dual problem requires F (x) to majorize 1{x ≥ t}. Note that F (x) has a 'kink' at x = µ, that is, F (x) is piecewise linear and can only change direction in x = µ. Solving (7) boils down to finding the tightest majorant. We have four candidates for the solution, which are depicted in Figure 1.
Scenario 1a implies F (0) = 0, F (t) = F (b) = 1, which gives dual solution and objective value The next step of our proof is to find a feasible solution for the primal problem which yields the same objective value as the solution to the dual problem. By weak duality of semi-infinite linear programming, we know that a feasible solution to the dual problem provides us with a valid upper bound for the optimal primal solution value. Now finding a feasible primal solution with an objective value equal to this upper bound results in strong duality. Next, we will provide a constructive approach for finding such a primal solution. Assume that we have strong duality.
The primal maximizer P * and the dual minimizer (λ * 0 , λ * 1 , λ * 2 ) are then related as Moreover, due to dual feasibility we must have that . This inequality combined with equation (10) is also known as the complementary slackness relation in (semi-infinite) linear programming. An immediate consequence of complementary slackness is that the worst-case probability distribution should be supported on the points where the dual solution function F * (x) = λ * 0 + λ * 1 x + λ * 2 |x − µ| coincides with the indicator function 1{x ≥ t}. For scenario 1a we have one (unique) option, that is, a discrete probability distribution with probability masses on the elements of the set {0, t, b}. The corresponding optimal probabilities of (6) follow from solving This gives and hence Since, by weak duality of semi-infinite linear programming, we have strong duality as both the primal and dual objective value are the same, these are the optimal solutions. Scenario 1b implies F (0) = F (t) = F (b) = 1 and hence λ 0 = 1, λ 1 = λ 2 = 0 with objective value 1. One feasible primal solution is p b = µ−t b−t , p t = 1 − p b , with objective 1. Note that this primal solution is not a unique optimum, as the dual solution function F * 1b (x) coincides with 1{x ≥ t} on the entire interval [t, b]. Therefore, one could construct an arbitrary (discrete, continuous or mixed) probability distribution with support on the interval [t, b], which then serves as the worst-case distribution, as long as the mean and MAD conditions are satisfied. Scenario 2a implies F (0) = F (µ) = 0, F (t) = 1, which gives and objective value Solving the optimal probabilities of (6), where we take {0, µ, t} for the support of the worst-case distribution, indeed confirms that p t = d 2(t−µ) . Scenario 2b gives F (0) = 0, F (µ) = F (b) = 1, which results in and dual objective value Solving (6) with support {0, t, b} confirms that p 0 = d 2µ . The proof is then completed by looking which scenario prevails on a specific interval, and these intervals can be determined by simply equating the minimum objective values and thereafter solving the resulting equations with respect to t to find τ 1 and τ 2 for, respectively, scenario 1 and 2.
We remark that the proof is identical for the strict inequality. Because the majorant is a continuous function, it is irrelevant whether the indicator function that is majorized is lower or upper semi-continuous.
We mention some noteworthy characteristics of the bound in Theorem 1. The bound is continuous in t = µ. If the support is symmetric around µ, then the worst-case probability is at least 1/2 for t ∈ [0, µ]. The upper bound for t ∈ [µ, b] is increasing for d ≤ 2µ(t − µ)/t and decreasing for larger values of d. This last observation in particular is interesting as one might anticipate the bound to increase with MAD. This also implies that when MAD is unknown, the worst-case probability based on only the support and mean is given by the result of Theorem 1 for d = 2µ(t − µ)/t. This indeed returns Markov's inequality. We also mention that the support information [0, b] can easily be extended to [a, b] with a ∈ R by shifting the distribution accordingly. The tail bounds for the second and third interval then change into respectively.
For a tight lower bound on P(X > t), we can use the results and the remark above on a slightly altered version of the input. The idea is formalized in the following theorem: Theorem 2. Consider a random variable X with a distribution P in P (µ,b,d) . Then, with τ 1 and τ 2 given by Proof. We reformulate the infimum as follows: inf Plugging in the results from Theorem 1 for t ∈ (a, b] then yields (19). Similarly, the result for inf P∈P P(X ≥ t) can be obtained.
We now describe in more detail the worst-case distributions found that are revealed in the proof of Theorem 1.
Proposition 1. Consider the set of worst-case distributions P * = arg sup the three-point distribution as derived in scenario 1a in the proof of Theorem 1.
}, all discrete/mixed distributions with probability mass d 2µ on 0 and the remainder of its probability mass supported on [t, b]. Proof. The proof follows almost directly from the complementary slackness relation explained in the proof of Theorem 1. For t ∈ [0, τ 1 ] the dual solution function coincides with 1{x ≥ t} on the interval [t, b]. Hence, all distributions that are supported on this interval and obey the mean and MAD requirements are possible candidates for the worst-case distribution. Next, one can apply a similar reasoning for t ∈ [µ, τ 2 ] and t ∈ [τ 2 , b]. The worst-case distribution can exist on the range where the dual solution function F * (x) and the indicator function coincide. To attain the same optimal value, the probability mass on the singletons is chosen accordingly. Finally, note that the second case is already shown in the proof of Theorem 1.
Observe that when t equals τ 1 , µ, or τ 2 , there is only a single discrete extremal distribution. Figure 2 provides examples of the worst-case distributions for several different parameter settings and values of t. Proposition 1 shows that the ambiguity set P (µ,b,d) results in a non-trivial collection of worst-case distributions; that is, the mean-MAD approach results in a set that does not solely include discrete distributions with a small number of atoms for t / ∈ [τ 1 , µ] ∪ {τ 2 }.
We next consider the tail bounds when also β = P(X ≥ µ) is known. We therefore consider the extended ambiguity set Using this ambiguity set results in new tight bounds. These results are stated in the following two theorems for which the primal-dual proofs are given in Appendix A. Figure 2: Examples of the extremal distributions that attain the tail probability bound as described in Proposition 1.
Theorem 3. Consider a random variable X with a distribution P in P (µ,b,d,β) . Then, with τ 1 and τ 2 given by Theorem 4. Consider a random variable X with a distribution P in P (µ,b,d,β) . Then, with τ 1 and τ 2 given by Note that for these bounds equality between P(X ≥ t) and P(X > t) does not hold. In particular, the bounds admit a jump discontinuity at µ for all distributions with β = 1 − d 2µ . In Figure 3 the upper and lower bounds are depicted for the ambiguity set that considers all distributions with µ = 0.5, d = 0.1875, β = 0.5, a = 0, and b = 1. As a point of reference, the Beta(2, 2) tail distribution, which is a member of the ambiguity set, is also plotted.

Comparison with other bounds
Closely related to our results is the discussion in section 4.1 of Ghosal and Wiesemann (2020) .
In particular, they consider, among others, an ambiguity set given bỹ The only difference with the ambiguity set we use is the inclusion of all distributions with a lower mean absolute deviation. This has major implications for the maximum and minimum probability to exceed t, however. First of all, it should be noted that the distribution with all its probability mass on µ is an element ofP (µ,b,d) for any value of d. This means that for any Moreover, for any t > µ and d > 2µ(t−µ) t , the maximum probability of X exceeding t is attained by a distribution with a mean absolute deviation equal to 2µ(t−µ) t , which is explained by the observation that the bound we obtain is decreasing in Clearly, because of the above observations, the theoretical maximum of P (X > t) has a much simpler closed-form solution than (5) for the ambiguity setP (µ,b,d) . A big downside is that many of the extra distributions contained inP (µ,b,d) but not in P (µ,b,d) might be unrealistic. Especially when the mean absolute deviation is known or can be accurately estimated, there is little reason to consider distributions with a different (in this case lower) mean absolute deviation. For large values of d relative to t in particular, usingP (µ,b,d) can lead to an overestimation of the maximum value of P (X > t). The observation that the maximum value of P (X > t) is decreasing in d for large values of d also means that considering distributions with a lower mean absolute deviation can lead to a higher bound on P (X > t).
Comparing the result of Theorem 1 to Cantelli's inequality (Chebyshev, 1867) is harder, since we assume the mean absolute deviation to be known, but not the variance. Hence, some relation between these two quantities is needed to be able to make a comparison. In particular, where β = P (X > µ) (Ben-Tal and Hochman, 1985). We note that this also implies d ≤ σ.
Throughout the comparison below we assume that d is given and compare the bound obtained in Theorem 1 with Cantelli's bound for different values of σ satisfying (23). Figure 4 illustrates this comparison for a simple numerical example with the following parameters: a = −1, µ = 0, b = 1, d = 1 4 . We consider three values for σ: σ = d = 1 4 , σ = 1 3 and σ = db 2 = 1 2 . Figure 4 gives rise to a number of interesting observations. First of all, we note that since Cantelli's bound is 1 for any t ≤ µ, the bound from Theorem 1 is at most Cantelli's bound as it includes an interval for which it is not 1. Furthermore, the flat area in the blue line corresponds to the values of t such that bound is lower than (5) for all τ * ≤ t ≤ b. This is true for all parameters as: In particular, for σ = d Cantelli's bound and (5) always coincide at t = µ + d, since: If, on the other hand, we choose σ = db 2 , its highest possible value, Cantelli's bound is higher than (5). This is true for all parameter values as well, as Cantelli's bound is increasing in σ and must thus be at least (5) for its highest possible value.
For intermediate values of σ, we observe behavior similar to the line corresponding to σ = 1 3 in Figure 4. More specifically, we find that (5) is lower than Cantelli's bound for all t in the two intervals [0,τ ] and[τ , τ ], with the three boundaries given bŷ Note that for some σ, such as σ = db 2 in Figure 4, it holds thatτ ≥ τ , that is, (5) is lower than Cantelli's bound for all t ∈ [µ, τ ]. To visually clarify all boundaries discussed above, Figure 5 only shows Cantelli's bound for σ = 0.27 and marks τ * ,τ , τ and τ .

Prior work on Chebyshev-type tail bounds
Multivariate generalizations of Chebyshev's inequality have also been studied. In Bertsimas and Popescu (2005) and Vandenberghe et al. (2007) generalizations are studied through formulating a convex optimization problem, given that the prescribed confidence region can be described by polynomial or linear and quadratic inequalities, respectively. In Grechuk et al. (2010) on the other hand, closed-form variants of Chebyshev's inequality are provided for different dispersion measures than the variance. Generalized versions of Chebyshev's inequality for products of random variables that focus on a one-sided inequality have also received some attention recently (Rujeerapaiboon et al., 2018). While Chebyshev's inequality is tight, it has been criticized for only being attained by pathological distributions that abuse the unboundedness of the underlying support and are not considered realistic in many applications (Van Parys et al., 2016b). A variant of the Chebyshev inequality that was already considered in Gauss (1821) restricts the distributions it considers to be unimodal. This yields an improvement by a factor 4 9 over the classical Chebyshev inequality. This idea of including unimodality has been extended to the multivariate case recently as well (Van Parys et al., 2016b).
All the above mentioned inequalities, however, still assume an unbounded support. De Schepper and Heijnen (1995) mention tail probability bounds that incorporate the upper bound of the random variable's range. A comparison is provided in Appendix B. Unfortunately, these bounds are only attained by rather pathological probability measures, i.e., point distributions with two or three atoms. Using the MAD instead of the variance results in a richer class of worst-case distributions.

Distribution-free analysis of OR models
We now turn to three classical OR models: the newsvendor problem, monopoly pricing and stop-loss reinsurance. These three models can be subjected to distribution-free analyses that make direct use of the novel Chebyshev bounds. This leads to closed-form solutions of the associated maxmin optimization. The common theme is that with ambiguity described in terms of mean, MAD and restricted support, distribution-free analysis leads to valuable structural insights, while unrestricted support often yields degenerate results.

Newsvendor problem
The newsvendor problem serves to find the order quantity that maximizes the expected profit for a single period given a stochastic demand. Denote by q the order quantity (number of units) and by D the stochastic demand during a single selling period. Per unit, p denotes the selling price and c the purchase cost. Let p > c, and assume without loss of generality that unsold The decision maker then chooses the optimal order quantity q * that solves max q E P [π(q, D)].
This solution is known to be the η := (p − c)/p quantile (critical quantile) of the distribution of D, that is, In practice, however, the decision maker might only know partial information on the demand distribution. Scarf (1958) pioneered distribution-free analysis of the newsvendor problem, when only the mean and the variance of the demand are known. Scarf obtained the optimal order quantity for the worst case demand, turning the newsvendor into a maxmin decision maker that with P (µ,σ) the ambiguity set that contains all distributions with a given mean µ and variance σ 2 , and solution We shall instead consider all demand distributions with given mean µ, MAD d and support [0, b], and consider This is the counterpart of problem (25). Scarf (1958) solved (25) directly, computing the lower bound inf P∈P (µ,σ) E P [π(q, D)] via a linear program. Instead, we do not solve (27) directly, but apply the robust Chebyshev bounds to the first-order condition for q * in (24). Clearly, tight lower and upper bounds for this quantile follow from inf P∈P (µ,d,b) P(D > q) and sup P∈P (µ,d,b) P(D > q), respectively, providing an interval that contains the optimal order quantity q * .
Theorem 5 (Order quantity bounds under mean-MAD-range ambiguity). Suppose the newsvendor knows the mean µ, the mean absolute deviation d and the upper bound b of the demand distribution P(D ≤ q). The optimal order quantity q * that solves max q E P [π(q, D)] is then contained The theorem provides various handles for a robust policy that responds to the uncertainty captured in P (µ,d,b) . The lower bound q L follows from the worst-case demand distribution.
Observe that q L is larger than µ when the profit margin η exceeds 1 − d/(2(b − µ)), and smaller than µ otherwise. This insight can be contrasted with q S in (26) that also considers the worstcase scenario, but then in view of P (µ,σ) ambiguity. Scarf's q S is larger than µ if η > 1/2 and smaller than µ otherwise. Hence, q L quantifies the dependency on b, where q S does not. In particular, when the profit margin η is fixed, the pessimistic newsvendor that uses q L will only order above the mean when b does not exceed µ + d/(2(1 − η)). Table 1 shows that the support [0, b] also influences the intervals [q L , q U ], in particular for low and high profit margins. We also recognize the three different regimes in Theorem 5 that correspond to low margins, average margins and high margins. We mention two further works related to Theorem 5. Ben-Tal and Hochman (1976) use general techniques for stochastic programs with limited information such as (27). For such stochastic programs the available information is often not sufficient to find the optimal solution.
Ben-Tal and Hochman (1976) develop a method to construct the minimal set that should contain the optimum. They also demonstrate this technique for the newsvendor model with given mean and MAD, but unbounded support, and obtain intervals that indeed arise from Theorem 5 for the limit b → ∞: Natarajan et al. (2017) introduce semi-variance as an extra piece of information about the skewness of the distribution. Together with the mean and variance, this results in a more restrictive ambiguity set (compared to Scarf), and therefore a less conservative (or sharper) estimation of q * . Theorem 5 can also be viewed as a way to address conservatism, by taking into account the finite support. We could even restrict the ambiguity further by using the robust Chebyshev bound with the additional P(X ≥ µ) = β constraint, which like semi-variance measures skewness. We apply these bounds to the newsvendor model in Appendix C.
Apart from modifying or narrowing the ambiguity set, conservatism can be alleviated by in this paper can be used for distribution-free analysis of more advanced models, including those modeling regret and utility mentioned above, the risk-averse newsvendor with stochastic price-dependent demand (Chen et al., 2009) and multi-product settings (Choi et al., 2011).

Monopoly pricing
The monopoly pricing problem maximizes a seller's profit when selling a single object to a buyer who is willing to pay some unknown value B. Traditionally, it is assumed that B is drawn from some distribution P(B ≤ r) that is known to the seller, so that the seller can set the optimal price. When there is a single buyer, the optimal strategy is to post a fixed price r that maximizes the expected profit rP(B > r); see Riley and Zeckhauser (1983) and Myerson (1981). The seller thus faces the tradeoff between price and sale, because the probability of sale P(B > r) decreases with the price r.
We consider a robust variant of this model, where instead of knowing the distribution, the seller only knows that the distribution of B is contained in P (µ,d,b) and chooses the price that maximizes the worst-case expected profit: Refer to the solution to this maxmin optimization problem as the robustly optimal (or maxmin) price r L . The tight lower bound for rP(B > r), denoted by F (r), follows from the robust Chebyshev bound for the tail probability P(B > r) in (19). The resulting worst-case profit function turns out not be concave and in fact to have multiple local maxima, as illustrated in Figure 6. Upon maximizing this worst-case profit function, the next result shows that there exist three ranges of dispersion (measured in MAD), each with a different optimal price that attains the largest local maximum.
Theorem 6 (Optimal price under mean-MAD-range ambiguity; Van Eijk and Van Leeuwaarden (2020)). Suppose the seller knows the mean µ, the mean absolute deviation d and the upper bound b of the value distribution. For b ∈ [µ, 5µ], the solution to the monopoly pricing problem (30) is given by where For b > 5µ and d ∈ [0, d 2 ] the optimal price is r L = r 1 . For b > 5µ and d ∈ (d 2 , d max ] the optimal price r L is either r 1 or r 2 . The proof of Theorem 6 is due to Van Eijk and Van Leeuwaarden (2020) and also presented in Appendix E. Theorem 6 shows that the pricing function r L is not monotone in the dispersion (measured in MAD), and that dispersion and support both have a major influence on the pricing strategy. The theorem identifies the dispersion thresholds d 1 and d 2 . As a function of dispersion, the price is smaller than µ and decreasing until d 1 , then is µ until d 2 , and then increases towards the maximal value b. Hence, only when dispersion is high, the seller is willing to set a price close to the maximum price b. This is illustrated in Figure 7. Theorem 6 degenerates when the support exceeds 5 times µ. In particular, when b → ∞, the optimal price is µ − dµ/2 for all d ≤ 2µ. This price is always lower than µ. Theorem 6 contributes to the active research field of robust pricing; see Carroll (2019) for a recent overview. We mention a few related works. Kos and Messner (2015) show that when the seller only knows the mean valuation, the maxmin profit is always zero. This solidified intuition that in absence of an upper bound arbitrarily high valuations cause overly pessimistic scenarios.
Indeed, Kos and Messner (2015) show that when there is an upper bound, the seller can find a nontrivial maxmin price that is smaller than µ and generates positive expected profit. The same holds true when the seller knows the variance (instead of upper bound); see Carrasco et al. (2018). In both cases, the maxmin price monotonically decreases with the allowed dispersion (either measured as upper bound or variance). Suzdaltsev (2018) considers the case when the mean, variance and upper bound are all three known, and finds a maxmin price that is smaller than µ for low variance and greater than µ for high variance.
The monopolistic pricing problem is connected to virtual valuations v(r) := r−P(B > r)/f B (r), with f B (r) the probability distribution function of B. These virtual valuations measure the surplus that can be extracted from agents, and can be used for optimal design of auctions with multiple buyers or object-types (Myerson, 1981). With a single buyer, the optimal price is the solution to v(r) = 0, which indeed is arg max r r P(B > r). Hence, there are possibilities for deploying the robust Chebyshev bounds for other models in pricing and mechanism design, for instance distribution-free analysis of auctions with multiple independent bids (Suzdaltsev, 2018) or correlated bids (Che, 2019).

Stop-loss reinsurance
Reinsurance is a classical topic in the actuarial sciences and insurance mathematics and implies that an insurance company transfers part of its risk to a reinsurance company; see e.g., Asmussen and Albrecher (2010), Kaas et al. (2008). Say an insurance company faces a total claim S that is the sum of n individual claims X i , i = 1, . . . , n. The insurance company pays the claim up to a level z, and the reinsurance company covers the remainder. This gives rise to the so-called retention function ψ(z, S) = min{S, z} that represents the payment of the insurer. We provide an upper bound for the standard stop-loss retention function in Appendix D.
The payment function of the reinsurance company puts forward a more challenging problem when the insurance coverage is limited. In this case, a relevant performance characteristic is to what extend the insurance company benefits from the reinsurance contract. This benefit is measured with the function When the total claim S stays below the retention limit z, the insurance company covers the entire claim, but when S exceeds z the reinsurer pays the excess claim up to a maximum m.
Thus, the reinsurance company does not compensate large claims that exceed the exit point m + z. Above this level the risk is retained by the insurance company.
We obtain a novel bound by using primal-dual arguments.
Theorem 7. The expected insurer's benefit is bounded by where the function φ(z, S) degenerates to max{S − z, 0} if z + m > b. In this case, Proof. See Appendix F.
An illustration of the bounds for the stop-loss payments is provided in Figure 8, where we display payments as functions of z with µ = 5, d = 1.77, and m = 3, m = 5 and m → ∞. We assume that the 'true' total claim S follows a Poisson (5)  These results complement the literature on tight bounds for expected claim payments. Cox (1991), considering bounded support and known first and second moment, obtains tight bounds using general results for moment problems. Other related works explore ways to sharpen the bounds using additional information. When modifying the ambiguity set by incorporating skewness information, imposing unimodality and symmetry conditions, or using higher order moments, the gap between the upper and lower bounds narrows significantly; see Heijnen (1990), De Vylder and Goovaerts (1982), and Jansen et al. (1986). Note that the mean-MAD information can easily be extended with skewness parameters, such as the probability β = P(S ≥ µ) or the median. In our case it is also possible to impose unimodality and symmetry conditions by altering the dual problem. Section 4 of Popescu (2005) discusses these modifications for general piecewise polynomial functions in the constraints of the dual problem. We discuss one such extension in the next section: the multivariate stop-loss reinsurance problem.

More applications for sums and optimization
As alluded to in the introduction, the novel tail bounds are part of a much larger research effort within the area of distributionally robust optimization (DRO), trying to exploit the tractability that comes with mean-MAD ambiguity constraints. To show this connection with DRO, we first extend our tail probability bound to sums of random variables (that arise often in DRO applications) in Section 4.1 and illustrate the effectiveness with an insurance example in Section 4.2. Section 4.3 then discusses the application of tail probability bounds to reformulate ambiguous chance constraints in DRO. In Section 4.4 we provide a realistic DRO example that arises in radiotherapy optimization.

Sums of random variables
Widely used in probability theory and stochastic OR, sums of random variables find application in areas such as inventory management, service operations management, mathematical finance and credit risk. Mathematical techniques for sums or random variables are covered in many standard texts on probability theory (Chung, 2001;Feller, 1971). For sums of i.i.d. random variables, variance then enters naturally (e.g., variance of the sum, central limit theorem), also for deriving tail bounds. The MAD of i.i.d. variables, on the other hand, cannot simply be summed. Leveraging the tight univariate tail bound, we establish a generic multivariate tail bound for sums of random variables. We do so without a specific application in mind, but with the goal of deriving broadly applicable distribution-free bounds.
Consider n random variables X 1 , . . . , X n with known support, mean and MAD, and consider the worst-case tail probability with F is the multi-dimensional ambiguity set, i.e., and B n the n-dimensional Borel σ-algebra. Note that we do not make any assumptions with regard to (in)dependence or correlation between the random variables. To analyze (35) we define a new random variable Y = n i=1 X i . Clearly, the support and mean of Y follow from (36): Unfortunately, the mean absolute deviation is unknown and applying Theorem 1 is thus not straightforward. To ease notation, we shall denote the sums of µ i , b i and d i , byb,μ andd, respectively. Theorem 8 presents a bound on (35) for any upper bound on the mean absolute deviation of Y .
Proof. We consider the following ambiguity set for the distribution of Y : It should be noted that (38) is not an ambiguity set of the form (4), because of the inequality for the mean absolute deviation. This also means that the tightest upper bound this approach can obtain for t ∈ [0,μ] is 1, as the distribution with probability mass 1 onμ is an element of G. For t ∈ μ,b , we can however infer from Theorem 1 that We will now simplify this expression by solving the maximization problem over d explicitly. To that end, we first note that the minimum is taken over two linear functions of d, an increasing and a decreasing one. Therefore, the global maximum is at the intersection of these functions, and the optimal d is thus eitherd or d * , where d * is such that

Solving this equation yields
We remark thatd is the optimal solution when as this means thatd < d * . Therefore, we find that The most obvious candidate ford is given by the sum of all mean absolute deviationsd, which is clearly an upper bound as (Postek et al., 2018): This bound, however, is generally not tight. It is tight, for example, when µ i , b i and d i are equal for all i = 1, . . . , n. We will use this bound in the remainder of this section. Several other possible bounds that can be used are described by Postek et al. (2018).
The allowed correlation structure is convenient in many situations. Take for instance the portfolio loss in credit risk, traditionally modeled as L n = n i=1 X i , with X 1 , . . . , X n the losses (due to default) of the individual obligors (Glasserman and Li, 2005). The standard scenario in credit risk is that losses are positively correlated, allowing L n to assume relatively large values, which can be measured in terms of Value-at-Risk, the α quantile of the loss distribution, i.e. VaR α := inf{t ≥ 0 : P[L n ≤ t] ≥ α}. The multivariate tail bound can be translated directly into bounds for Value-at-Risk.

Insurance portfolio example
Consider an insurer that holds a portfolio that can incur random losses X 1 , ..., X n , which correspond to different types of insurance claims. The insurer considers the cumulative value of the claims X 1 , ..., X n and the probability that this value exceeds the available capital t; that is, the insurer is interested in the ruin probability, which is given by where the distribution P lies in F. Similar to the portfolio loss in credit risk, the allowed dependence structure proves useful when considering insurance problems with catastrophic risks. variables with mean-MAD information, we can provide a conservative bound for the probability of the event that the total claim exceeds t. In Figure 9 we show the mean-MAD bound and the actual ruin probabilities for two different dependence structures, i.e., the independence and comonotonic copula which couple the X i , i = 1, 2, 3. Sums of random variables are known to be the 'riskiest' when they are comonotonic; that is, the terms grow simultaneously and hence a component can in no way hedge another one. The joint cumulative distribution of X 1 , . . . , X n attains the so-called Fréchet-Hoeffding upper bound (Kaas et al., 2008): Tight tail probability bounds for distribution-free decision making Roos et al.
where Y i ∼ X i , i = 1, . . . , n, which entails that these are, in a certain sense, the most related variables. Notice that our novel conservative tail bound holds for all dependence structures, and it also covers heavy-tailed distributions with finite MAD. This is shown in Figure 10. Estimating dependence between insurance claims is often practically infeasible with only a limited amount of data; see McNeil et al. (2015) for a comprehensive discussion on this topic. It is doable, however, to estimate the mean and MAD accurately even when data is scarce. In the previous section we assumed that the distributional parameters of S are known. In practice, however, the insurer relies on claim data of individual contracts. The total loss often consists of several different types of insurance contracts, e.g., the three claim types mentioned in the example above. Assume next that we have exact knowledge of the mean µ i , MAD d i , and support of each of these losses X 1 , . . . , X n . We are now interested in evaluating Note that we again do not make any assumptions with respect to the dependence structure of the random variables. To evaluate (40) we sum them all together: S n := n i=1 X i . The support and mean of the aggregate risk S n are given by The mean absolute deviation is again unknown, and hence we follow an approach similar to the derivation of the multi-dimensional tail probability bound. We also adopt the same notation.
Theorem 9 presents an upper bound of (40) for any upper bound on the mean absolute deviation of S n .
Theorem 9. For anyd such that Proof. See Appendix F.
Figure 10 displays our bound applied to a stop-loss reinsurance contract. As before, we model the three claims as random variables X i , i = 1, 2, 3, for which the insurance company only has information regarding the support, mean, and MAD. For the sake of comparison, assume that the 'true' losses are lognormally distributed with the aforementioned parameter values and that the insurer has reliably estimated the mean and MAD. Using the bounds for sums of random variables with mean-MAD information, we can provide a conservative bound for the expected stop-loss payment given the retention limit z. In Figure 10 we show the mean-MAD bound and the 'true' expected claim payments for two different dependency structures. We again consider the independence and comonotonic copulas as the underlying structures coupling the losses X i , i = 1, 2, 3.

Ambiguous chance constraints
A large class of decision problems in OR can be formulated as optimization problems of the for some convex functions f and g 1 , . . . , g m . Here, x denotes the decision variable, while Z is some given parameter. In many applications, Z is uncertain and the constraints are often replaced by chance constraints, i.e., for some accepted risk level ∈ (0, 1), it is instead required for each i = 1, . . . , m. This type of chance constraint is referred to as a single chance constraint, while a single probabilistic constraint on all constraints being satisfied simultaneously is known as a joint chance constraint. Examples of such applications include, but are not limited to finance (Dert and Oldenkamp, 2000), network design (Wang, 2007) and call-center staffing (Gurvich et al., 2010). Chance constraints suffer from tractability issues, however, and additionally require an exact specification of the distribution. Recently, therefore, there has been an emerging interest in ambiguous chance constraints, in which the distribution of the uncertain parameters is not fully specified. A single ambiguous chance constraint takes the form for some ambiguity set P. The primary goals in analyzing such constraints are (i) determining under which conditions (42) defines a convex set in x and (ii) finding a representation and/or approximation of this set in terms of simple convex inequalities. Often, such conditions and representations are prohibitively hard to find when considering joint ambiguous chance constraints.
We therefore contain our discussion to single ambiguous chance constraints.
One way to find convex reformulations of single ambiguous chance constraints is the use of classical probability inequalities. Hoeffding's inequality, for example, has been used by Ben-Tal and Nemirovski (2000) and Bertsimas and Popescu (2005) to derive approximations and reformulations of (42), respectively, when g is linear and the components of Z are independent, symmetric, and bounded. Building on that, Bertsimas et al. (2019) use sub-Gaussian theory to derive safe approximations under the same assumptions on Z, and the relaxed assumption that g is concave in Z. Similarly, a generalized Chebyshev inequality is used by Xu et al. (2012) to find convex reformulations, while Nemirovski and Shapiro (2007) and Postek et al. (2018) use Bernstein bounds to derive convex approximations. The tail probability bounds we derive also allow for such methods to be applied. We discuss convex reformulations of single chance constraints in which the uncertainty is present as a single random variable. Specifically, we consider the case where m = 1, i.e., there is only a single uncertain parameter Z, and use our main result from Theorem 1 to reformulate the semi-infinite constraint (42) into a convex constraint for certain forms of g.
We first present a convex reformulation of an ambiguous chance constraint when g is convex in x and affine in Z. This case is often referred to as right-hand side uncertainty.
Theorem 10. Letg : R n → R and let Z be a 1-dimensional random variable whose distribution lies in the ambiguity set for some d ∈ [0, 1]. For any ∈ 0, 1 2 and x ∈ R n it holds that if and only ifg Proof. We first rewrite (43) to From Theorem 1 and the fact that < 1 2 we know that it must hold that −g(x) > E [Z] = 0. Given that requirement, we know by Theorem 1 that From d ∈ [0, 1] and ∈ 0, 1 2 , it follows that 1 − d 2 > , and thus any feasible solution x must satisfy −g(x) ≥ 1 and/or d −2g(x) ≤ . The latter can be equivalently stated as which can easily be combined with the former as Because d 2 > 0, we find that the requirement −g(x) > 0 is redundant, and thus (43) is equivalent to (44).
We note that assuming d ∈ [0, 1] is equivalent to assuming that P (µ,b,d) is nonempty. Moreover, we note that we cannot assume a support of [−1, 1] and a mean of 0 without loss of generality, as this implies that the support of Z is symmetric around the mean. It is straightforward, however, to extend our results to an ambiguity set with support [−1, u] for some u > 0, which can be assumed without loss of generality.
A similar reasoning to that in Theorem 10 can be applied to joint chance constraints with independent right-hand side uncertainty. Here, because of the use of the support information of the random variable, we do not provide an exact reformulation, but a safe approximation instead. Providing a tractable reformulation when uncertainty is present in the left-hand side, on the other hand, is significantly more complicated. Using our results, we provide a reformulation when g is bilinear in x and Z. For the sake of conciseness, the first two results as well as the proof of the result below are included in Appendix G.
Theorem 11. Letā,â ∈ R n , h ∈ R and Z ∈ R be a random variable whose distribution lies in the ambiguity set for some d ∈ [0, 1]. For any ∈ 0, 1 2 and x ∈ R n it holds that if and only ifā Observe that (46) has a linear representation, and optimization problems containing ambiguous chance constraints of the form (45) can thus be solved very efficiently. Also observe that our results extend to any constraint that consists of a bilinear term in x and Z and any other convex term independent of Z.
For the sake of conciseness, we only presented convex reformulations for two types of ambiguous chance constraints in this section. The tail probability bound derived in Theorem 1 can be applied to derive convex reformulations and safe approximations to other ambiguous chance constraints as well.
We mention two related works to our discussion on ambiguous chance constraints. Hanasusanto et al. (2017) present a tractable framework for joint ambiguous chance constraints under a few simplifying conditions. In particular, they assume a conic, hence unbounded, support, which is a key difference to our approach. Their approach is very powerful in settings for which an unbounded support makes sense, however, as they are able to elegantly deal with joint ambiguous chance constraints as well. Xie and Ahmed (2018), on the other hand, consider ambiguous chance constraints given a bounded support and moment information. Their assumptions on the ambiguity set do, however, exclude exact distributional information on nonlinear functions of the uncertain parameter, which we do assume in exact knowledge of the mean absolute deviation.

Optimization problem from radiotherapy
We now illustrate these implications by applying our result to an optimization problem that arises in radiotherapy. Here, the biological effective radiation dose delivered to a tumor is to be maximized subject to a constraint on the biological effective dose delivered to the surrounding healthy tissue. Mathematically, the biological effective dose (BED) for a dose x ∈ R n delivered over n fractions is given by where ρ is the radiosensitivity parameter of the irradiated tissue. More specifically, it can be interpreted as the tissue's sensitivity to fractionation, where a low value indicates a high sensitivity to fractionation, i.e., the distribution of treatment over multiple fractions.
While there is an extensive body of research on the value of ρ for different tumor sites, it remains subject to significant uncertainty (Joiner and Van der Kogel, 2016). Moreover, since this value can differ from patient to patient, there is a very limited amount of data available and there is little evidence to suggest it follows some well known distribution. Throughout the rest of the example, we denote the sensitivity to fractionation by ρ 1 and ρ 2 for the tumor and the surrounding healthy tissue, respectively.
For illustrative purposes, we consider a setting in which it has been decided to deliver the treatment over two fractions, i.e., the optimization variables are limited to the dose in the first and second fraction. Moreover, we focus on the uncertainty of ρ 2 , and thus model the restriction of sparing the healthy tissue through an ambiguous chance constraint. Mathematically, we wish to solve the following optimization problem (Ten Eikelder et al., 2019): where σ is the generalized dose-sparing factor that denotes the fraction of the mean tumor dose that the healthy tissue receives on average, x min is the minimum dose that must be delivered in each fraction, and t(ρ 2 ) denotes the tolerance level of the healthy tissue and is given by In other words, the healthy tissue is known to tolerate a total dose of D gray if it is delivered in T fractions under dose shape factor φ. This dose shape factor is a parameter that characterizes the spatial heterogeneity of a dose distribution (Perkó et al., 2018).
The ambiguity of ρ is modeled through the mean-MAD ambiguity set, where the lower bound of the support is given by a instead of 0. The ambiguous chance constraint (47b) is not naturally stated in a form that Theorem 10 or 11 can be applied to. It can be rewritten, however, as where we note multiplication by ρ is allowed as its support is nonnegative. Leveraging the tail probability bound, we find for ∈ (0, µ−a b−a ) that (48) is equivalent to We solve (47) for a specific, realistic set of parameters taken from Ten Eikelder et al. (2019), which are reported in Table 2. Figure 11 shows the feasible region and optimal solution of (47) for different values of as well as the feasible region when we assume having the exact  knowledge that ρ = µ. Remarkable in this example is the similarity between the feasible region of the problem without uncertainty and that of the ambiguous problem for = 0.1 and = 0.05.
From the feasible region for = 0.01, however, it is clear that requiring that a low risk of violation results in a solution that is much worse in terms of tumor BED. It does, on the other hand, illustrate how the shape of the feasible region changes with : the feasibility of unbalanced solutions, i.e., solutions that administer a different dose in the two fractions, is impacted much more severely than that of balanced solutions.

Conclusion and outlook
Tail probabilities are ubiquitous in probabilistic studies in many areas of science and application domains. As the original Chebyshev's inequality for mean-variance ambiguity, we expect our novel tail bounds for mean-MAD ambiguity to find many applications.
In our search for tight bounds under limited information, we had to solve for the worst-case distribution and worst-case value of the expectation of the indicator function 1{X ≥ t}. In this paper the limited information was captured ambiguity sets P (µ,b,d) and P (µ,b,d,β) , and it turned out that the combination of the non-convex indicator function with these ambiguity set gave rise to semi-infinite linear programs with easy, closed-form solutions.
In future work, we expect to find more such solvable classes, i.e. specific combinations of objective function (other than the indicator function) and ambiguity sets that together give rise to solvable liner programs and hence easy extremal distributions. In this way, one could try to sharpen the tail bounds by including more information (e.g. higher moments or percentiles), or to consider objective functions other than the tail probability. Our proof method based on solving the dual problem with piecewise-linear majorants is not tailor-made for the indicator function, and could potentially work for a much larger class of (measurable) objective functions.
Another direction we shall pursue is the application of the bounds to more complex, and possible high-dimensional robust optimization problems. To do so, we shall leverage the connection with the quickly evolving field of DRO, as illustrated by examples in Section 4. Indeed, minmax and maxmin decision problems arise naturally, and the bounds and proof techniques can help in advancing that field. Yue, J., Chen, B., and Wang, M.-C. (2006). Expected value of distribution information for the newsvendor problem. Operations Research, 54(6):1128-1136.

A Proofs of tail bounds
Proof Theorem 3. We will show that additional information on a particular instance of the tail distribution (e.g., β = P(X ≥ µ)) results in tighter bounds. We again consider the Borel measurable function 1 {x≥t} . Under P (µ,b,d,β) ambiguity of the random variable X we now need to solve which is a semi-infinite linear program with four equality constraints.
The dual problem has four variables, and therefore the tightest majorant touches 1{x ≥ t} at four or fewer points. Since F (x) is piecewise linear with a jump discontinuity there are four candidate scenarios, which are described in Figure 12. x and objective value Solving the primal problem (49) with probability masses on the points {0, t, µ, b} gives Since primal and dual feasible solutions have the same objective value we have strong duality and hence found the optimal solutions.
The proof is then completed by looking which scenario prevails on a specific interval, and these intervals can be determined by simply equating the minimum objective values and thereafter solving for t to find τ 1 and τ 2 for, respectively, scenario 1 and scenario 2.
Proof Theorem 4. We will use similar arguments as in the proof of Theorem 3, but now we consider a dual problem where we are maximizing a minorizing function. Under P (µ,b,d,β) ambiguity of the random variable X we now need to solve which is a semi-infinite linear program with four equality constraints. The dual problem is given by sup λ0,λ1,λ2,λ3 Note that F (x) has both a 'kink' and a jump discontinuity at x = µ. Additionally, we are to construct the tightest minorant this time. The dual problem has four variables, and hence the tightest minorant touches 1 {x>t} at four or fewer points. Since F (x) is piecewise linear with a jump discontinuity there are four candidate solutions, which are depicted in Figure 13. Scenario 1a implies F (t) = 0, F (µ) = F (b) = 1, which gives the dual solution and objective value Solving the primal problem (58) with probability masses on the points {t, µ, b} gives Scenario 2a implies that F (0) = F (t) = 0 and F (b) = 1, which results in with objective value Indeed, solving the primal problem with probability masses on {0, t, b} gives with objective value 0. All probability mass is placed on points that are less than or equal to t. Hence, the optimal primal objective value is also equal to 0.
Finally, the proof is completed by inspecting which scenario prevails on a specific interval. These intervals are determined by equating the maximum objective values and solving these equations with respect to t to find τ 1 and τ 2 for scenario 1 and 2, respectively.
B Comparison with tight bounds for (µ, b, σ) ambiguity Next to the comparison with Cantelli's inequality in Section 2.2, we also look at the tight bounds for the (µ, b, σ)-ambiguity set. De Schepper and Heijnen (1995) provide expressions for these bounds. The upper bound is given by The lower bound equals inf Comparing these bounds with their MAD equivalents is again not straightforward, and we will be using a similar numerical example to compare our results with those of De Schepper and Heijnen (1995).
We use the following parameter setting: a = 0, µ = 1, b = 2, d = 1/4. Furthermore, we consider three values for σ: σ = d = 1/4, σ = 1/3, and σ = db/2 = 1/2. following problem is the mean-MAD counterpart of the mean-variance-semivariance model discussed in the work of Natarajan and Zhou (2007): where β adopts the role of the semivariance to model skewness information. Instead of solving this problem directly, we apply the robust Chebyshev bounds with skewness information to the first-order condition of the newsvendor problem. Thus, we will use the results from Theorem 3 and Theorem 4 to bound the tail distribution of the demand D. Tight lower and upper bounds for the optimal order quantity q * follow from inf P∈P (µ,b,d,β) P(D > q) and sup P∈P (µ,b,d,β) P(D > q), respectively. The following result provides an interval that contains the optimal order quantity q * .
Theorem 12 (Order quantity bounds under mean-MAD-β ambiguity). Suppose the newsvendor knows the mean µ, the mean absolute deviation d, the probability P(D ≥ µ) = β and the upper bound b of the demand distribution P(D ≤ q). The optimal order quantity q * that solves max q E P [π(q, D)] is then This result provides a robust policy that models the uncertainty captured by the ambiguity set P (µ,b,d,β) . Obviously, ordering the mean is optimal if η = 1 − β. The lower bound q L relates to the worstcase demand distribution. Similar to the mean-MAD-range case, q L is larger than µ when the profit margin η exceeds 1 − d/(2(b − µ)). Hence, an interesting observation is that the skewness information does not influence the point at which we order more than the mean. This can be contrasted with the results of Natarajan and Zhou (2007). These authors show for P (µ,σ,s) ambiguity that the order quantity is greater than µ if η > 1 2 (1 + s), where s is the normalized semivariance. Table 3 shows that the bounded support [0, b] again influences the intervals for low and high profit margins. The intervals in this section, however, are sharper than the ones found in Table 1, which is a consequence of the additional information regarding the skewness of the demand distribution.

D Upper bound for retention function
Proposition 2. The worst-case expected claim payment of the direct insurer as a function of the retention limit z is given by where the values of τ 1 and τ 2 are given by Proof. First, note that ψ(z, S) = min{S, z} = S − max{S − z, 0}, and hence our problem boils down to solving sup P∈P (µ,b,d) The second term is convex in the uncertain parameter and therefore we can apply the lower bounds discussed by Postek et al. (2018), that is, we solve the optimization problem inf P∈P (µ,b,d) which is convex and piecewise linear in the optimization variable θ. Hence, one can find the optimal solution value that depends on the retention limit z. Solving problem (73) and subtracting the optimal value from µ results in the four cases mentioned in (71).

E Proof of Theorem 6 (EC)
We shall now solve (30).
Let us make some observations about the functions G(t) := inf P∈P (µ,d) P(B ≥ t) and F (r) := rG(r).
The function G(t) starts in G(0) = (2µ − d)/(2µ), decays until τ 1 , remains flat for t ∈ [τ 1 , µ], decays until reaching zero at t = τ 2 , and then remains zero. The function F (r) starts in 0, is concave until τ 1 , increases linearly for t ∈ [τ 1 , µ], and then remains concave until reaching zero. This implies that the maximum of F (r) is the maximum of the first concave part, the point F (µ), or the maximum of the second concave part.
The first concave part is given by the function for which F 1 (r) = 0 gives . This value should be compared with F (µ) = dµ 2(b−µ) , and in fact solving for d for which F 1 (r 1 ) = F (µ) gives The second concave part is given by the function for which F 2 (r) = 0 gives Solving for d for which F 2 (r 2 ) = F (µ) gives Upon reflection, d 2 must be the point where the right-derivative of F 2 (r) turns positive, which indeed is the case. It can be shown that d 2 ≤ d 1 for µ ∈ [0, b/5] and d 2 ≥ d 1 for µ ∈ [b/5, b]; see Figure 16.
First consider the case d 1 ≤ d 2 ≤ d max , hence assuming b ≤ 5µ. For d ∈ [0, d 1 ], the maximum of F (r) is located at r 1 . For d ∈ [d 1 , d 2 ], the maximum of F (r) is at r = µ, because F (µ) ≥ F (r 1 ) and the function F (r) will not increase for r ≥ µ. For d ∈ [d 2 , d max ], the maximum of F (r) is located at r 2 , because F (µ) ≤ F (r 1 ) and the function F (r) still increases after r = µ until r = r 2 . Figure 6 illustrates these three scenarios by plotting F (r) for various values of d.
Then consider the case d 2 ≤ d 1 ≤ d max , hence assuming b ≥ 5µ. Now r = µ is no longer a candidate optimizer, because F (r) viewed as a function of d, will become increasing at r = µ before F (µ) beats F (r 1 ). Therefore, the maximum will be in either r 1 or r 2 . It will be r 1 when F (r 1 ) > F (r 2 ) and vice versa.
Solving for d for which F (r 1 ) = F (r 2 ) is computationally tractable, but does not lead to a closed-form solution. See Figure 17 for an example of this case.

F Proofs of distribution-free stop-loss bounds
Proof of Theorem 7. We will show via primal-dual reasoning that the stated stop-loss formulas are tight upper bounds. We now consider the measurable function φ(z, s). Under P (µ,b,d) ambiguity of the random which is a semi-infinite linear program with three equality constraints.
Consider the dual of (74), inf λ0,λ1,λ2 Define F (s) := λ 0 + λ 1 s + λ 2 |s − µ|. Then the inequality in (75) can be written as φ(z, s) ≤ F (s), ∀s, i.e. F (s) majorizes the 'staircase' function φ(z, s). Note that F (s) has a 'kink' at x = µ. The dual problem has three variables, and therefore there exists a majorant that touches φ(z, s) at three or fewer points. Since F (s) is piecewise linear with a 'wedge' shape there are six candidate scenarios, which are displayed in Figure 19. When m + z ≤ µ, F (s) = 1 and touches φ(z, s) in [m + z, b] (scenario 1a), or Scenario 1a implies that F (0) = F (m + z) = F (µ) = F (b) = m, and hence λ 0 = m, λ 1 = λ 2 = 0 with objective value m. It is clear that the optimal primal objective value is also equal to m as the primal solution can only assign probability to values greater than or equal to m + z (which is a consequence of complementary slackness).
Solving the primal problem (74) with probability masses on the points {0, m + z, b} gives s φ(z, s) dP(s) = m 2µ + bd/(µ − b) 2(m + z) Since primal and dual feasible solutions have the same objective value we have strong duality and hence found the optimal solutions. Scenario 2a implies F (0) = 0, F (µ) = µ − z, F (m + z) = m which gives with objective value Solving the optimal probabilities for the primal problem (74)  Solving for the corresponding optimal probabilities of (74) confirms that s φ(z, s) dP(s) = p (m+z) m + p b m = m 1 − d 2µ .
The proof of the first part of the theorem is then completed by taking the minimum for each scenario.
The second part is an immediate consequence of upper bound (8) in Postek et al. (2018), which is a result that was already shown by Ben-Tal and Hochman (1972).
Proof of Theorem 9. We consider the following ambiguity set for the distribution of S: We can deduce from F ⊆ G and Theorem 7 that We will now solve the maximization problems over d explicitly. For the instance with m + z ≤μ the problem is easily solved by recognizing that the distribution with probability mass 1 onμ is a member of G, and therefore the maximum m will be paid almost surely.
For the second case, note that we take the minimum over two linear functions of d, an increasing and a decreasing one. Therefore, the global maximum is at the intersection of these functions, and the optimal d is thus eitherd or d * , where d * is such that Finally, the third case can be solved in the exact same way, resulting in the established theorem.
We can thus combine both cases, such that (77) is equivalent tō where it should be noted that this implies h −ā x > 0, which is therefore redundant.
Theorem 13. Let g : R n → R, h ∈ R and let Z be an 1-dimensional random variable whose distribution lies in the ambiguity set for some d ∈ R m such that d i ∈ [0, 1] for all i. Let ∈ 0, 1 2 and I be the set of indices i such that di 2 ≤ 1. For any y ∈ R n it holds that which is a convex set of constraints if all g i are convex functions.
Proof. Using the pairwise independence of Z we find From this, it readily follows that it must at least hold that sup P∈P P [Z i > −g i (x)] ≤ for all i. From Theorem 1 we consequently know that it must hold that −g i (x) > E [Z i ] = 0 for all i, as we know that < 1 2 . Given that −g i (x) > 0, we know that Since 1 − di 2 ≥ 1 2 it follows from < 1 2 that it must hold for any feasible solution x that Moreover, we note that di −2gi(x) ≤ is equivalent to −g i (x) ≥ di 2 and thus if di 2 ≥ 1, imposing this is overly restrictive, as we know from (81) that the worst-case probability of violation is 0, not in that situation. For all such i, we thus simply require that g i (x) + 1 ≤ 0, such that sup P∈P P [Z i > −g i (x)] = 0 ∀i ∈ I.
Given this analysis, we find that (80) is equal to , and thus (78) Theorem 15. Let g i : R n → R be convex for i = 1, . . . , m and let Z be a 1-dimensional random variable whose distribution lies in the ambiguity set for some d ∈ [0, 1]. For any ∈ 0, 1 2 and x ∈ R n it holds that if and only if Proof. Using the fact that every constraint features the same uncertain parameter Z we find We thus know that (82) is equivalent to to which we can apply Theorem 1 with t = min i {−g i (x)}. Since we know < 1 2 , we once again find that it must hold that min i {−g i (x)} > 0 and thus by the same reasoning as in the proof of Theorem 14 we find that it must hold that