Skip to main content
Log in

Superquantiles at Work: Machine Learning Applications and Efficient Subgradient Computation

  • SI: Optimization, Convex and Variational Analysis
  • Published:
Set-Valued and Variational Analysis Aims and scope Submit manuscript

Abstract

R. Tyrell Rockafellar and his collaborators introduced, in a series of works, new regression modeling methods based on the notion of superquantile (or conditional value-at-risk). These methods have been influential in economics, finance, management science, and operations research in general. Recently, they have been subject of a renewed interest in machine learning, to address issues of distributional robustness and fair allocation. In this paper, we review some of these new applications of the superquantile, with references to recent developments. These applications involve nonsmooth superquantile-based objective functions that admit explicit subgradient calculations. To make these superquantile-based functions amenable to the gradient-based algorithms popular in machine learning, we show how to smooth them by infimal convolution and detail numerical procedures to compute the gradients of the smooth approximations. We put the approach into perspective by comparing it to other smoothing techniques and by illustrating it on toy examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P.A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: A system for large-scale machine learning. In: Keeton, K., Roscoe , T. (eds.) 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, pp. 265–283. USENIX Association. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi (2016)

  2. Kairouz, P., et al.: Advances and open problems in federated learning. https://doi.org/10.1561/2200000083, vol. 14, pp 1–210 (2021)

  3. Beck, A., Teboulle, M.: Smoothing and first order methods: A unified framework. SIAM J. Optim. 22(2), 557–580 (2012). https://doi.org/10.1137/100818327

    Article  MathSciNet  Google Scholar 

  4. Ben-Tal, A., Ghaoui, L. E., Nemirovski, A.: Robust Optimization, Princeton Series in Applied Mathematics, vol. 28. Princeton University Press, Princeton (2009). https://doi.org/10.1515/9781400831050

    Google Scholar 

  5. Ben-Tal, A., Teboulle, M.: Expected utility, penalty functions, and duality in stochastic nonlinear programming. Manage. Sci. 32, 1445–1466 (1986). https://doi.org/10.1287/mnsc.32.11.1445

    Article  MathSciNet  Google Scholar 

  6. Ben-Tal, A., Teboulle, M.: An old-new concept of convex risk measures: The optimized certainty equivalent. Math. Finance 17(3), 449–476 (2007). https://doi.org/10.1111/j.1467-9965.2007.00311.x

    Article  MathSciNet  Google Scholar 

  7. Chen, C., Mangasarian, O. L.: A class of smoothing functions for nonlinear and mixed complementarity problems. Comput. Optim. Appl. 5(2), 97–138 (1996). https://doi.org/10.1007/BF00249052

    Article  MathSciNet  Google Scholar 

  8. Cucker, F., Zhou, D. X.: Learning theory An approximation theory viewpoint, vol. 24. Cambridge University Press, Cambridge (2007). https://doi.org/10.1017/CBO9780511618796

    Book  Google Scholar 

  9. Curi, S., Levy, K.Y., Jegelka, S., Krause, A.: Adaptive sampling for stochastic risk-averse learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/0b6ace9e8971cf36f1782aa982a708db-Abstract.html (2020)

  10. Dantzig, G. B.: Discrete-variable extremum problems. Oper. Res. 5(2), 266–288 (1957). https://doi.org/10.1287/opre.5.2.266

    Article  MathSciNet  Google Scholar 

  11. Duchi, J.C., Namkoong, H.: Learning models with uniform performance via distributionally robust optimization. arXiv:1810.08750 (2018)

  12. Fan, Y., Lyu, S., Ying, Y., Hu, B.: Learning with average top-k loss. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 497–505. https://proceedings.neurips.cc/paper/2017/hash/6c524f9d5d7027454a783c841250ba71-Abstract.html (2017)

  13. Föllmer, H., Schied, A.: Convex measures of risk and trading constraints. Finance Stoch. 6(4), 429–447 (2002). https://doi.org/10.1007/s007800200072

    Article  MathSciNet  Google Scholar 

  14. Guigues, V., Sagastizábal, C.A.: Risk-averse feasible policies for large-scale multistage stochastic linear programs. Math. Program. 138(1-2), 167–198 (2013). https://doi.org/10.1007/s10107-012-0592-1

    Article  MathSciNet  Google Scholar 

  15. Hiriart-Urruty, J. B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms. Springer, Heidelberg (1993). Two volumes

    Book  Google Scholar 

  16. Ho-Nguyen, N., Wright, S. J.: Adversarial classification via distributional robustness with wasserstein ambiguity. arXiv:2005.13815 (2020)

  17. Holstein, K., Vaughan, J. W., Daumé, H. III, Dudík, M., Wallach, H.M.: Improving fairness in machine learning systems: What do industry practitioners need? In: Brewster, S.A., Fitzpatrick, G., Cox, A.L., Kostakos, V. (eds.) Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, Glasgow, Scotland, UK, May 04-09, 2019, p. 600. ACM. https://doi.org/10.1145/3290605.3300830 (2019)

  18. Howard, R. A., Matheson, J. E.: Risk-sensitive Markov decision processes. Manage. Sci. Theory 18, 356–369 (1972). https://doi.org/10.1287/mnsc.18.7.356

    Article  MathSciNet  Google Scholar 

  19. Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairness-aware classifier with prejudice remover regularizer. In: Flach, P.A., Bie, T.D., Cristianini, N. (eds.) Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II, Lecture Notes in Computer Science, vol. 7524, pp. 35–50. Springer. https://doi.org/10.1007/978-3-642-33486-3_3 (2012)

  20. Kawaguchi, K., Lu, H.: Ordered SGD: A new stochastic optimization framework for empirical risk minimization. In: Chiappa, S., Calandra, R. (eds.) The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], Proceedings of Machine Learning Research, vol. 108, pp. 669–679. PMLR. http://proceedings.mlr.press/v108/kawaguchi20a.html (2020)

  21. Knight, W.: A self-driving Uber has killed a pedestrian in Arizona. Ethical Tech (2018)

  22. Laguel, Y., Malick, J., Harchaoui, Z.: First-order optimization for superquantile-based supervised learning. In: 30th IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2020, Espoo, Finland, September 21-24, 2020, pp. 1–6. IEEE. https://doi.org/10.1109/MLSP49062.2020.9231909 (2020)

  23. Laguel, Y., Pillutla, K., Malick, J., Harchaoui, Z.: A superquantile approach to federated learning with heterogeneous devices. In: 55th Annual Conference on Information Sciences and Systems, CISS 2021, Baltimore, MD, USA, March 24-26, 2021, pp. 1–6. IEEE. https://doi.org/10.1109/CISS50987.2021.9400318 (2021)

  24. Lee, J., Park, S., Shin, J.: Learning bounds for risk-sensitive learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/9f60ab2b55468f104055b16df8f69e81-Abstract.html (2020)

  25. Levy, D., Carmon, Y., Duchi, J.C., Sidford, A.: Large-scale methods for distributionally robust optimization. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/64986d86a17424eeac96b08a6d519059-Abstract.html (2020)

  26. Luna, J. P., Sagastizábal, C.A., Solodov, M. V.: An approximation scheme for a class of risk-averse stochastic equilibrium problems. Math. Program. 157(2), 451–481 (2016). https://doi.org/10.1007/s10107-016-0988-4

    Article  MathSciNet  Google Scholar 

  27. Metz, R.: Microsoft’s neo-Nazi sexbot was a great lesson for makers of AI assistants. Artif. Intell. (2018)

  28. Mhammedi, Z., Guedj, B., Williamson, R.C.: Pac-bayesian bound for the conditional value at risk. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/d02e9bdc27a894e882fa0c9055c99722-Abstract.html (2020)

  29. Miranda, S. I.: Superquantile regression: theory, algorithms, and applications. Tech. rep., Naval postgraduate school Monterey ca (2014)

  30. Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., Tanaka, T.: Nonparametric return distribution approximation for reinforcement learning. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 799–806. Omnipress. https://icml.cc/Conferences/2010/papers/652.pdf (2010)

  31. Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17(2), 527–566 (2017). https://doi.org/10.1007/s10208-015-9296-2

    Article  MathSciNet  Google Scholar 

  32. Nesterov, Y. E.: Introductory Lectures on Convex Optimization - A Basic Course, Applied Optimization, vol. 87. Springer, Berlin (2004). https://doi.org/10.1007/978-1-4419-8853-9

    Book  Google Scholar 

  33. Nesterov, Y. E.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005). https://doi.org/10.1007/s10107-004-0552-5

    Article  MathSciNet  Google Scholar 

  34. Nocedal, J., Wright, S. J.: Numerical Optimization. Springer, New York (2006)

    MATH  Google Scholar 

  35. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf (2019)

  36. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in python, vol. 12, pp 2825–2830 (2011). http://dl.acm.org/citation.cfm?id=2078195

  37. Pollard, D.: A User’s Guide to Measure Theoretic Probability, vol. 8. Cambridge University Press, Cambridge (2002). https://doi.org/10.1017/CBO9780511811555

    MATH  Google Scholar 

  38. Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, vol. 97, pp. 5389–5400. PMLR. http://proceedings.mlr.press/v97/recht19a.html (2019)

  39. Rockafellar, R.T.: Solving stochastic programming problems with risk measures by progressive hedging. Set-Valued Var. Anal. 26(4), 759–768 (2018). https://doi.org/10.1007/s11228-017-0437-4

    Article  MathSciNet  Google Scholar 

  40. Rockafellar, R.T., Royset, J.O.: Superquantiles and their applications to risk, random variables, and regression. In: Theory Driven by Influential Applications, pp. 151–167. INFORMS (2013)

  41. Rockafellar, R.T., Royset, J.O.: Random variables, monotone relations, and convex analysis. Math. Program. 148(1-2), 297–331 (2014). https://doi.org/10.1007/s10107-014-0801-1

    Article  MathSciNet  Google Scholar 

  42. Rockafellar, R.T., Royset, J.O., Miranda, S.I.: Superquantile regression with applications to buffered reliability, uncertainty quantification, and conditional value-at-risk. Eur. J. Oper. Res. 234(1), 140–154 (2014). https://doi.org/10.1016/j.ejor.2013.10.046

    Article  MathSciNet  Google Scholar 

  43. Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance 26(7), 1443–1471 (2002)

    Article  Google Scholar 

  44. Rockafellar, R.T., Uryasev, S., et al.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)

    Article  Google Scholar 

  45. Rockafellar, R.T., Wets, R. J.B.: Variational Analysis, vol. 317. Springer Science & Business Media, Berlin (2009)

    Google Scholar 

  46. Ruszczynski, A., Shapiro, A.: Optimization of convex risk functions. Math. Oper. Res. 31(3), 433–452 (2006). https://doi.org/10.1287/moor.1050.0186

    Article  MathSciNet  Google Scholar 

  47. Sarykalin, S., Serraino, G., Uryasev, S.: Value-at-risk vs. conditional value-at-risk in risk management and optimization. In: State-of-the-art decision-making tools in the information-intensive age, pp. 270–294. Informs (2008)

  48. Shafieezadeh-Abadeh, S., Kuhn, D., Esfahani, P.M.: Regularization via mass transportation. J. Mach. Learn. Res. 20, 103:1–103:68 (2019). http://jmlr.org/papers/v20/17-633.html

    MathSciNet  MATH  Google Scholar 

  49. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, Cambridge (2014). http://www.cambridge.org/de/academic/subjects/computer-science/pattern-recognition-and-machine-learning/understanding-machine-learning-theory-algorithms

    Book  Google Scholar 

  50. Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming - Modeling and Theory. MOS-SIAM Series on Optimization, 2nd edn., vol. 16 . SIAM, Philadelphia (2014). http://bookstore.siam.org/mo16/

    MATH  Google Scholar 

  51. Soma, T., Yoshida, Y.: Statistical learning with conditional value at risk. arXiv:2002.05826 (2020)

  52. Sutton, R.S., Barto, A.G.: Reinforcement Learning. An Introduction. MIT Press, Cambridge (2018)

    MATH  Google Scholar 

  53. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Solla, S.A., Leen, T.K., Müller, K. (eds.) Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pp. 1057–1063. The MIT Press. http://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation (1999)

  54. Tamar, A., Chow, Y., Ghavamzadeh, M., Mannor, S.: Policy gradient for coherent risk measures. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1468–1476. https://proceedings.neurips.cc/paper/2015/hash/024d7f84fff11dd7e8d9c510137a2381-Abstract.html (2015)

  55. Vershynin, R.: High-Dimensional Probability. An Introduction with Applications in Data Science, vol. 47. Cambridge University Press, Cambridge (2018). https://doi.org/10.1017/9781108231596

    MATH  Google Scholar 

  56. Wainwright, M. J.: High-Dimensional Statistics. A Non-Asymptotic Viewpoint, vol. 48. Cambridge University Press, Cambridge (2019). https://doi.org/10.1017/9781108627771

    MATH  Google Scholar 

  57. Williamson, R.C., Menon, A.K.: Fairness risk measures. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, vol. 97, pp. 6786–6797. PMLR. http://proceedings.mlr.press/v97/williamson19a.html (2019)

Download references

Acknowledgements

We acknowledge support from ANR-19-P3IA-0003 (MIAI - Grenoble Alpes), NSF DMS 2023166, DMS 1839371, CCF 2019844, the CIFAR program “Learning in Machines and Brains”, and faculty research awards.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yassine Laguel.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem 1

In this appendix, we provide a complete proof of Theorem 1. For classical results in this spirit, we refer to the monograph [8]. For discussions on statistical aspects of statistical learning, we refer to e.g., [11, 24, 28].

The key step in the proof of Theorem 1 is to show the uniform convergence

$$ S^{p}_{n}(w) \to S^{p}(w) \text{ almost surely for all }w \in W. $$
(38)

Indeed, once we have this, the result immediately follows as

$$ \begin{array}{@{}rcl@{}} 0 \le S^{p}(w_{n}^{\star}) - S^{p}(w^{\star}) &=& S^{p}(w_{n}^{\star}) - S^{p}_{n}(w_{n}^{\star}) + S^{p}_{n}(w_{n}^{\star}) - S^{p}_{n}(w^{\star}) + S^{p}_{n}(w^{\star}) - S^{p}(w^{\star})\\ &\le& 2 \underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| \to 0, \end{array} $$

where we use \(S^{p}_{n}(w_{n}^{\star }) \le S^{p}_{n}(w^{\star })\) in the second inequality.

In order to prove (38), we use the variational expression of the superquantile (4). We define

$$ \bar{S}^{p}(w, \eta) = \eta + \frac{1}{1-p} \mathbb{E}_{(x, y)\sim P} [\max(\ell(y, \varphi(w, x))-\eta,0)] , $$

so that, using that the loss is bounded by B, we can write

$$ S^{p}(w) = \underset{\eta \in [0, B]}{\min} \bar{S}^{p}(w, \eta). $$

We define the analogous empirical version \(\bar {S}^{p}_{n}(w, \eta )\) so that \(S^{p}_{n}(w) = \min \limits _{\eta \in [0, B]} \bar {S}^{p}_{n}(w, \eta )\). Note that \(\bar {S}^{p}_{n}(w, \eta )\) is measurable for each fixed (w, η) and \(S^{p}_{n}(w)\) is measurable for each fixed w.

Claim 1

Under Assumption 2, the random variable

$$ \delta_{n}(w, \eta) := \bar{S}^{p}_{n}(w, \eta) - \bar{S}^{p}(w, \eta) $$

has mean zero, lies almost surely in [−B, B], and satisfies

$$ |\delta_{n}(w, \eta) - \delta_{n}(w^{\prime}, \eta^{\prime})| \le 2M/(1-p) \text{dist}_{\varphi}(w, w^{\prime}) + 2(1+1/(1-p)) |\eta - \eta^{\prime}| . $$
(39)

Note first that \(\mathbb {E}[\bar {S}^{p}_{n}(w, \eta )] = \bar {S}^{p}(w, \eta )\) and that the boundedness of δn comes from the boundedness of the loss function. The Lipschitzness of δn also comes from the one of the loss function, as follows. Using that \(\max \limits \{\cdot , 0\}\) is 1-Lipschitz and that the loss is M-Lipschitz, we get

$$ \begin{array}{@{}rcl@{}} |\max\{\ell(y, \varphi(w, x)) - \eta, 0\} &-& \max\{\ell(y, \varphi(w^{\prime}, x)) - \eta^{\prime}, 0\}| \\ &&\le |\ell(y, \varphi(w, x)) - \ell(y, \varphi(w^{\prime}, x))| + |\eta - \eta^{\prime}| \\ &&\le M \|\varphi(w, x) - \varphi(w^{\prime}, x)\| + |\eta - \eta^{\prime}| \\ &&\le M\text{dist}_{\varphi}(w, w^{\prime}) + |\eta - \eta^{\prime}| . \end{array} $$

Then, Eq. 39 simply follows from the triangle inequality, and Claim 1 is proved.

The next step in the proof is, for a given ε > 0

  • to construct a cover T of W × [0, B], and then

  • to control the convergence over the points of T, more precisely to control the probability of the event

    $$ E_{n}(\varepsilon) = \underset{(w, \eta) \in T}{\bigcap} \left\{ \delta_{n}(w, \eta) \le \varepsilon / 2 \right\} . $$

First, using Assumption 1, we consider T1 a (ε(1 − p)/(8M))-cover of W with respect to distφ. We also consider T2 a uniform discretization of the line segment [0, B] at width ε(1 + 1/(1 − p))/8. We can introduce the cover of W × [0, B]

$$ T = T_{1} \times T_{2} \subset W \times [0, B]. $$

Since, |T2| = 8B/((1 + 1/(1 − p))ε), we have that |T| = (8B/((1 + 1/(1 − p))ε))N(ε(1 − p)/(8M)). Note that the event \(\left \{ \delta _{n}(w, \eta ) \le \varepsilon / 2 \right \}\) for fixed (w, η) since δn(w, η) is measurable, and therefore, En(ε) is measurable since it is a finite intersection.

To get uniform convergence, it is sufficient to control what happens at points of T. Indeed, for any (w, η), there exists a point \((w^{\prime }, \eta ^{\prime }) \in T\) such that \(\text {dist}_{\varphi }(w, w^{\prime }) \le \varepsilon (1-p)/(8M)\) and \(|\eta - \eta ^{\prime }| \le \varepsilon (1+1/(1-p))/ 8\). As a consequence, if the event En(ε) holds, then

$$ \begin{array}{@{}rcl@{}} \delta_{n}(w, \eta) &\le& \delta_{n}(w^{\prime}, \eta^{\prime}) + |\delta_{n}(w, \eta) - \delta_{n}(w^{\prime}, \eta^{\prime})| \\ &\overset{(39)}{\le}& \delta_{n}(w^{\prime}, \eta^{\prime}) + 2M/(1-p) \text{dist}_{\varphi}(w, w^{\prime}) + 2(1+1/(1-p)) | \eta - \eta^{\prime}|\\ &\le& \frac{\varepsilon}{2} + \frac{\varepsilon}{4} + \frac{\varepsilon}{4} = \varepsilon. \end{array} $$

This implies that events of interest are included in \(\overline {E}_{n}(\varepsilon )\), the complement of En(ε); we have indeed

$$ \left\{\underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| > \varepsilon\right\} \subset \left\{\underset{(w,\eta) \in W \times [0, B]}{\sup} \delta_{n}(w, \eta) >\varepsilon\right\} \subset \overline E_{n}(\varepsilon) . $$

Postponing the proof of measurability of these events to Claim 3 at the end of this proof, we have the following bound on the sum of probabilities

$$ \sum\limits_{n=1}^{\infty} \mathbb{P}\left( \underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| > \varepsilon\right) \le \sum\limits_{n=1}^{\infty} \mathbb{P}\left( \overline E_{n}(\varepsilon)\right). $$
(40)

Claim 2

The probabilities of the complements of En(ε) are summable, i.e.,

$$ \sum\limits_{n=1}^{\infty} \mathbb{P}\left( \overline E_{n}(\varepsilon)\right) < \infty . $$

This is a direct application of the Hoeffding’s inequality (see e.g. [55, Theorem 2.2.2]) as follows. For any fixed (w, η) ∈ W × [0, B], the Hoeffding’s inequality gives

$$ \mathbb{P}(|\delta_{n}(w, \eta)| > \varepsilon/2) \le 2 \exp\left( - \frac{n\varepsilon^{2}}{2B^{2}}\right) . $$

Applied to all (w, η) ∈ T, this yields

$$ \mathbb{P}\big(\overline E_{n}(\varepsilon)\big) \le 2|T|\exp\left( - \frac{n\varepsilon^{2}}{2B^{2}}\right) = \frac{16B}{((1+1/(1-p))\varepsilon} N\left( \frac{\varepsilon(1-p)}{8M}\right) \exp\left( - \frac{n\varepsilon^{2}}{2B^{2}}\right) . $$

and proves Claim 2.

We conclude on the uniform convergence (38) with the Borel-Cantelli Lemma by the classical rationale (see e.g. the textbook [37, Chap. 2, Sec. 6]): the bound (40) and Claim 2 give that the probabilities for any 𝜖 are summable; applying Borel-Cantelli with the sequence 𝜖k = 1/k gives the uniform convergence (38), which completes the proof of the theorem.

Finally, it remains to show measurability of some events of interest.

Claim 3

The following events are measurable for each ε > 0:

$$ \begin{array}{@{}rcl@{}} E^{\prime}_{n}(\varepsilon) &:=& \left\{\underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| > \varepsilon\right\} , \\ E^{\prime\prime}_{n}(\varepsilon) &:=& \left\{\underset{(w,\eta) \in W \times [0, B]}{\sup} \delta_{n}(w, \eta) >\varepsilon\right\} . \end{array} $$

We prove the claim for \(E^{\prime }_{n}(\varepsilon )\) and the second one is entirely analogous. Since the set \(\mathbb {Q}^{d}\) of d-dimensional rationals is dense in \(\mathbb {R}^{d}\) and the map \(w \mapsto |S^{p}_{n}(w) - S^{p}(w)|\) is continuous, we have that

$$ \underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| = \underset{w \in W \cap \mathbb{Q}^{d}}{\sup} |S^{p}_{n}(w) - S^{p}(w)| . $$

Since the latter term is a supremum over a countable set of measurable random variables, we get that \(E^{\prime }_{n}(\varepsilon )\) is measurable.

Appendix B: Numerical Illustrations

We provide simple illustrations of the interest of using superquantile for machine learning. More precisely, we reproduce the experimental framework of the computational experiments of [9] and we solve the superquantile optimization problems with the approach depicted here, by combining smoothing and quasi-Newton. For additional experiments with other datasets, metrics, and contexts, we refer to [9].

We consider two basic machine learning tasks (regression and classification) with linear prediction functions φ(w, x) = wx and with two standard datasets, from the UCI ML repository. Denoting these datasets Pn = {(xi, yi)}1≤in, we introduce the (regularized) empirical risk minimization

$$ \underset{w \in \mathbb{R}^{d}}{\min} ~~\mathbb{E}_{(x,y)\sim P_{n}}\left[\ell(y, {w}^{\top} x)\right] +\frac{1}{2n}\|w\|^{2} , $$

and its smoothed superquantile analogous

$$ \underset{w \in \mathbb{R}^{d}}{\min} ~~{[\mathbb{S}^{\nu}_{p}]}_{(x,y)\sim P_{n}}\left[\ell(y, {w}^{\top} x) \right] +\frac{1}{2n}\|w\|^{2}. $$

We solve these problems using L-BFGS via the toolbox SPQR [22] offering an simple user-interface and implementing the oracles (with the Euclidean smoothing of Example 5 for the smoothed approximation).

Regression and Least-Squares

We consider a regularised least square regression on the dataset Abalone from the UCI Machine learning repository. We perform a 80%/20% train-test split on the dataset. We minimize the least-squares loss on the training set both in expectation and with respect to the superquantile (with p = 0.98 and ν = 0.1).

We report on Fig. 7 the distribution of errors \(|y_{i} - {w}^{\top } x_{i}|\) for the testing dataset for both models w (standard in blue and superquantile in red). We observe that the superquantile model exhibits a thinner upper tail than the risk-neutral model, which is quantified by the shift to the left of 0.98 quantile. This comes at the price of lower performance in expectations than the model trained with expectation, which is clear visible on the picture and quantified by the shift to the right of the mean.

Fig. 7
figure 7

Regression: histogram of the regression errors on the testing dataset for the model learning by the superquantile approach (red) compared to the one of the classical empirical risk minimization (violet). We see a reshaping of the histogram of errors and a gain on worst-case errors

Classification and Logistic regression

We consider a logistic regression on the Australian Credit dataset. We randomly split the dataset with a 80%/20% train-test split for 5 different seeds. For each seed, we perform a pessimistic distributional shift on the training dataset by downsampling the majority class (similarly to what is done in [9, Sec. 5.2]). More precisely, we remove an important fraction of the majority class, randomly selected, so that it counts afterward for only 10% of the minority class. We tune then the safety level parameter p by a k-cross validation on the shifted dataset and select the safety parameter yielding the best validation accuracy. The grid we use for tuning this parameter is [0.8, 0.85, 0.9, 0.95, 0.99] We finally compute with this parameter the testing accuracy and the testing precision.

We report the testing accuracy and the testing precision averaged over the 5 different seeds on the table of Fig. 8 with the associated standard deviation. We observe that the superquantile model brings better performance for both in terms of accuracy and precision than the standard model.

Fig. 8
figure 8

Classification: better testing accuracy and precision for the superquantile approach, in the case of distributional shifts

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Laguel, Y., Pillutla, K., Malick, J. et al. Superquantiles at Work: Machine Learning Applications and Efficient Subgradient Computation. Set-Valued Var. Anal 29, 967–996 (2021). https://doi.org/10.1007/s11228-021-00609-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11228-021-00609-w

Keywords

Navigation