Superquantiles at Work: Machine Learning Applications and Efficient Subgradient Computation

Laguel, Yassine; Pillutla, Krishna; Malick, Jérôme; Harchaoui, Zaid

doi:10.1007/s11228-021-00609-w

Superquantiles at Work: Machine Learning Applications and Efficient Subgradient Computation

SI: Optimization, Convex and Variational Analysis
Published: 30 December 2021

Volume 29, pages 967–996, (2021)
Cite this article

Set-Valued and Variational Analysis Aims and scope Submit manuscript

Yassine Laguel¹,
Krishna Pillutla²,
Jérôme Malick ORCID: orcid.org/0000-0003-0371-0457¹ &
…
Zaid Harchaoui²

202 Accesses
2 Citations
Explore all metrics

Abstract

R. Tyrell Rockafellar and his collaborators introduced, in a series of works, new regression modeling methods based on the notion of superquantile (or conditional value-at-risk). These methods have been influential in economics, finance, management science, and operations research in general. Recently, they have been subject of a renewed interest in machine learning, to address issues of distributional robustness and fair allocation. In this paper, we review some of these new applications of the superquantile, with references to recent developments. These applications involve nonsmooth superquantile-based objective functions that admit explicit subgradient calculations. To make these superquantile-based functions amenable to the gradient-based algorithms popular in machine learning, we show how to smooth them by infimal convolution and detail numerical procedures to compute the gradients of the smooth approximations. We put the approach into perspective by comparing it to other smoothing techniques and by illustrating it on toy examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Stochastic Subgradient Method for Distributionally Robust Non-convex and Non-smooth Learning

Article 08 July 2022

A Stochastic Primal-Dual Method for Optimization with Conditional Value at Risk Constraints

Article 24 June 2021

On the pervasiveness of difference-convexity in optimization and statistics

Article 29 May 2018

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P.A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: A system for large-scale machine learning. In: Keeton, K., Roscoe , T. (eds.) 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, pp. 265–283. USENIX Association. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi (2016)
Kairouz, P., et al.: Advances and open problems in federated learning. https://doi.org/10.1561/2200000083, vol. 14, pp 1–210 (2021)
Beck, A., Teboulle, M.: Smoothing and first order methods: A unified framework. SIAM J. Optim. 22(2), 557–580 (2012). https://doi.org/10.1137/100818327
Article MathSciNet Google Scholar
Ben-Tal, A., Ghaoui, L. E., Nemirovski, A.: Robust Optimization, Princeton Series in Applied Mathematics, vol. 28. Princeton University Press, Princeton (2009). https://doi.org/10.1515/9781400831050
Google Scholar
Ben-Tal, A., Teboulle, M.: Expected utility, penalty functions, and duality in stochastic nonlinear programming. Manage. Sci. 32, 1445–1466 (1986). https://doi.org/10.1287/mnsc.32.11.1445
Article MathSciNet Google Scholar
Ben-Tal, A., Teboulle, M.: An old-new concept of convex risk measures: The optimized certainty equivalent. Math. Finance 17(3), 449–476 (2007). https://doi.org/10.1111/j.1467-9965.2007.00311.x
Article MathSciNet Google Scholar
Chen, C., Mangasarian, O. L.: A class of smoothing functions for nonlinear and mixed complementarity problems. Comput. Optim. Appl. 5(2), 97–138 (1996). https://doi.org/10.1007/BF00249052
Article MathSciNet Google Scholar
Cucker, F., Zhou, D. X.: Learning theory An approximation theory viewpoint, vol. 24. Cambridge University Press, Cambridge (2007). https://doi.org/10.1017/CBO9780511618796
Book Google Scholar
Curi, S., Levy, K.Y., Jegelka, S., Krause, A.: Adaptive sampling for stochastic risk-averse learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/0b6ace9e8971cf36f1782aa982a708db-Abstract.html (2020)
Dantzig, G. B.: Discrete-variable extremum problems. Oper. Res. 5(2), 266–288 (1957). https://doi.org/10.1287/opre.5.2.266
Article MathSciNet Google Scholar
Duchi, J.C., Namkoong, H.: Learning models with uniform performance via distributionally robust optimization. arXiv:1810.08750 (2018)
Fan, Y., Lyu, S., Ying, Y., Hu, B.: Learning with average top-k loss. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 497–505. https://proceedings.neurips.cc/paper/2017/hash/6c524f9d5d7027454a783c841250ba71-Abstract.html (2017)
Föllmer, H., Schied, A.: Convex measures of risk and trading constraints. Finance Stoch. 6(4), 429–447 (2002). https://doi.org/10.1007/s007800200072
Article MathSciNet Google Scholar
Guigues, V., Sagastizábal, C.A.: Risk-averse feasible policies for large-scale multistage stochastic linear programs. Math. Program. 138(1-2), 167–198 (2013). https://doi.org/10.1007/s10107-012-0592-1
Article MathSciNet Google Scholar
Hiriart-Urruty, J. B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms. Springer, Heidelberg (1993). Two volumes
Book Google Scholar
Ho-Nguyen, N., Wright, S. J.: Adversarial classification via distributional robustness with wasserstein ambiguity. arXiv:2005.13815 (2020)
Holstein, K., Vaughan, J. W., Daumé, H. III, Dudík, M., Wallach, H.M.: Improving fairness in machine learning systems: What do industry practitioners need? In: Brewster, S.A., Fitzpatrick, G., Cox, A.L., Kostakos, V. (eds.) Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, Glasgow, Scotland, UK, May 04-09, 2019, p. 600. ACM. https://doi.org/10.1145/3290605.3300830 (2019)
Howard, R. A., Matheson, J. E.: Risk-sensitive Markov decision processes. Manage. Sci. Theory 18, 356–369 (1972). https://doi.org/10.1287/mnsc.18.7.356
Article MathSciNet Google Scholar
Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairness-aware classifier with prejudice remover regularizer. In: Flach, P.A., Bie, T.D., Cristianini, N. (eds.) Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II, Lecture Notes in Computer Science, vol. 7524, pp. 35–50. Springer. https://doi.org/10.1007/978-3-642-33486-3_3 (2012)
Kawaguchi, K., Lu, H.: Ordered SGD: A new stochastic optimization framework for empirical risk minimization. In: Chiappa, S., Calandra, R. (eds.) The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], Proceedings of Machine Learning Research, vol. 108, pp. 669–679. PMLR. http://proceedings.mlr.press/v108/kawaguchi20a.html (2020)
Knight, W.: A self-driving Uber has killed a pedestrian in Arizona. Ethical Tech (2018)
Laguel, Y., Malick, J., Harchaoui, Z.: First-order optimization for superquantile-based supervised learning. In: 30th IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2020, Espoo, Finland, September 21-24, 2020, pp. 1–6. IEEE. https://doi.org/10.1109/MLSP49062.2020.9231909 (2020)
Laguel, Y., Pillutla, K., Malick, J., Harchaoui, Z.: A superquantile approach to federated learning with heterogeneous devices. In: 55th Annual Conference on Information Sciences and Systems, CISS 2021, Baltimore, MD, USA, March 24-26, 2021, pp. 1–6. IEEE. https://doi.org/10.1109/CISS50987.2021.9400318 (2021)
Lee, J., Park, S., Shin, J.: Learning bounds for risk-sensitive learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/9f60ab2b55468f104055b16df8f69e81-Abstract.html (2020)
Levy, D., Carmon, Y., Duchi, J.C., Sidford, A.: Large-scale methods for distributionally robust optimization. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/64986d86a17424eeac96b08a6d519059-Abstract.html (2020)
Luna, J. P., Sagastizábal, C.A., Solodov, M. V.: An approximation scheme for a class of risk-averse stochastic equilibrium problems. Math. Program. 157(2), 451–481 (2016). https://doi.org/10.1007/s10107-016-0988-4
Article MathSciNet Google Scholar
Metz, R.: Microsoft’s neo-Nazi sexbot was a great lesson for makers of AI assistants. Artif. Intell. (2018)
Mhammedi, Z., Guedj, B., Williamson, R.C.: Pac-bayesian bound for the conditional value at risk. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. https://proceedings.neurips.cc/paper/2020/hash/d02e9bdc27a894e882fa0c9055c99722-Abstract.html (2020)
Miranda, S. I.: Superquantile regression: theory, algorithms, and applications. Tech. rep., Naval postgraduate school Monterey ca (2014)
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., Tanaka, T.: Nonparametric return distribution approximation for reinforcement learning. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 799–806. Omnipress. https://icml.cc/Conferences/2010/papers/652.pdf (2010)
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17(2), 527–566 (2017). https://doi.org/10.1007/s10208-015-9296-2
Article MathSciNet Google Scholar
Nesterov, Y. E.: Introductory Lectures on Convex Optimization - A Basic Course, Applied Optimization, vol. 87. Springer, Berlin (2004). https://doi.org/10.1007/978-1-4419-8853-9
Book Google Scholar
Nesterov, Y. E.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005). https://doi.org/10.1007/s10107-004-0552-5
Article MathSciNet Google Scholar
Nocedal, J., Wright, S. J.: Numerical Optimization. Springer, New York (2006)
MATH Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf (2019)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in python, vol. 12, pp 2825–2830 (2011). http://dl.acm.org/citation.cfm?id=2078195
Pollard, D.: A User’s Guide to Measure Theoretic Probability, vol. 8. Cambridge University Press, Cambridge (2002). https://doi.org/10.1017/CBO9780511811555
MATH Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, vol. 97, pp. 5389–5400. PMLR. http://proceedings.mlr.press/v97/recht19a.html (2019)
Rockafellar, R.T.: Solving stochastic programming problems with risk measures by progressive hedging. Set-Valued Var. Anal. 26(4), 759–768 (2018). https://doi.org/10.1007/s11228-017-0437-4
Article MathSciNet Google Scholar
Rockafellar, R.T., Royset, J.O.: Superquantiles and their applications to risk, random variables, and regression. In: Theory Driven by Influential Applications, pp. 151–167. INFORMS (2013)
Rockafellar, R.T., Royset, J.O.: Random variables, monotone relations, and convex analysis. Math. Program. 148(1-2), 297–331 (2014). https://doi.org/10.1007/s10107-014-0801-1
Article MathSciNet Google Scholar
Rockafellar, R.T., Royset, J.O., Miranda, S.I.: Superquantile regression with applications to buffered reliability, uncertainty quantification, and conditional value-at-risk. Eur. J. Oper. Res. 234(1), 140–154 (2014). https://doi.org/10.1016/j.ejor.2013.10.046
Article MathSciNet Google Scholar
Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance 26(7), 1443–1471 (2002)
Article Google Scholar
Rockafellar, R.T., Uryasev, S., et al.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)
Article Google Scholar
Rockafellar, R.T., Wets, R. J.B.: Variational Analysis, vol. 317. Springer Science & Business Media, Berlin (2009)
Google Scholar
Ruszczynski, A., Shapiro, A.: Optimization of convex risk functions. Math. Oper. Res. 31(3), 433–452 (2006). https://doi.org/10.1287/moor.1050.0186
Article MathSciNet Google Scholar
Sarykalin, S., Serraino, G., Uryasev, S.: Value-at-risk vs. conditional value-at-risk in risk management and optimization. In: State-of-the-art decision-making tools in the information-intensive age, pp. 270–294. Informs (2008)
Shafieezadeh-Abadeh, S., Kuhn, D., Esfahani, P.M.: Regularization via mass transportation. J. Mach. Learn. Res. 20, 103:1–103:68 (2019). http://jmlr.org/papers/v20/17-633.html
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, Cambridge (2014). http://www.cambridge.org/de/academic/subjects/computer-science/pattern-recognition-and-machine-learning/understanding-machine-learning-theory-algorithms
Book Google Scholar
Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming - Modeling and Theory. MOS-SIAM Series on Optimization, 2nd edn., vol. 16 . SIAM, Philadelphia (2014). http://bookstore.siam.org/mo16/
MATH Google Scholar
Soma, T., Yoshida, Y.: Statistical learning with conditional value at risk. arXiv:2002.05826 (2020)
Sutton, R.S., Barto, A.G.: Reinforcement Learning. An Introduction. MIT Press, Cambridge (2018)
MATH Google Scholar
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Solla, S.A., Leen, T.K., Müller, K. (eds.) Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], pp. 1057–1063. The MIT Press. http://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation (1999)
Tamar, A., Chow, Y., Ghavamzadeh, M., Mannor, S.: Policy gradient for coherent risk measures. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1468–1476. https://proceedings.neurips.cc/paper/2015/hash/024d7f84fff11dd7e8d9c510137a2381-Abstract.html (2015)
Vershynin, R.: High-Dimensional Probability. An Introduction with Applications in Data Science, vol. 47. Cambridge University Press, Cambridge (2018). https://doi.org/10.1017/9781108231596
MATH Google Scholar
Wainwright, M. J.: High-Dimensional Statistics. A Non-Asymptotic Viewpoint, vol. 48. Cambridge University Press, Cambridge (2019). https://doi.org/10.1017/9781108627771
MATH Google Scholar
Williamson, R.C., Menon, A.K.: Fairness risk measures. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, vol. 97, pp. 6786–6797. PMLR. http://proceedings.mlr.press/v97/williamson19a.html (2019)

Download references

Acknowledgements

We acknowledge support from ANR-19-P3IA-0003 (MIAI - Grenoble Alpes), NSF DMS 2023166, DMS 1839371, CCF 2019844, the CIFAR program “Learning in Machines and Brains”, and faculty research awards.

Author information

Authors and Affiliations

Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000, Grenoble, France
Yassine Laguel & Jérôme Malick
University of Washington, Seattle, WA, USA
Krishna Pillutla & Zaid Harchaoui

Authors

Yassine Laguel
View author publications
You can also search for this author in PubMed Google Scholar
Krishna Pillutla
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Malick
View author publications
You can also search for this author in PubMed Google Scholar
Zaid Harchaoui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yassine Laguel.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Theorem 1

In this appendix, we provide a complete proof of Theorem 1. For classical results in this spirit, we refer to the monograph [8]. For discussions on statistical aspects of statistical learning, we refer to e.g., [11, 24, 28].

The key step in the proof of Theorem 1 is to show the uniform convergence

$$ S^{p}_{n}(w) \to S^{p}(w) \text{ almost surely for all }w \in W. $$

(38)

Indeed, once we have this, the result immediately follows as

$$ \begin{array}{@{}rcl@{}} 0 \le S^{p}(w_{n}^{\star}) - S^{p}(w^{\star}) &=& S^{p}(w_{n}^{\star}) - S^{p}_{n}(w_{n}^{\star}) + S^{p}_{n}(w_{n}^{\star}) - S^{p}_{n}(w^{\star}) + S^{p}_{n}(w^{\star}) - S^{p}(w^{\star})\\ &\le& 2 \underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| \to 0, \end{array} $$

where we use $S^{p}_{n}(w_{n}^{\star }) \le S^{p}_{n}(w^{\star })$ in the second inequality.

In order to prove (38), we use the variational expression of the superquantile (4). We define

$$ \bar{S}^{p}(w, \eta) = \eta + \frac{1}{1-p} \mathbb{E}_{(x, y)\sim P} [\max(\ell(y, \varphi(w, x))-\eta,0)] , $$

so that, using that the loss is bounded by B, we can write

$$ S^{p}(w) = \underset{\eta \in [0, B]}{\min} \bar{S}^{p}(w, \eta). $$

We define the analogous empirical version $\bar {S}^{p}_{n}(w, \eta )$ so that $S^{p}_{n}(w) = \min \limits _{\eta \in [0, B]} \bar {S}^{p}_{n}(w, \eta )$. Note that $\bar {S}^{p}_{n}(w, \eta )$ is measurable for each fixed (w, η) and $S^{p}_{n}(w)$ is measurable for each fixed w.

Claim 1

Under Assumption 2, the random variable

$$ \delta_{n}(w, \eta) := \bar{S}^{p}_{n}(w, \eta) - \bar{S}^{p}(w, \eta) $$

has mean zero, lies almost surely in [−B, B], and satisfies

$$ |\delta_{n}(w, \eta) - \delta_{n}(w^{\prime}, \eta^{\prime})| \le 2M/(1-p) \text{dist}_{\varphi}(w, w^{\prime}) + 2(1+1/(1-p)) |\eta - \eta^{\prime}| . $$

(39)

Note first that $\mathbb {E}[\bar {S}^{p}_{n}(w, \eta )] = \bar {S}^{p}(w, \eta )$ and that the boundedness of δ_n comes from the boundedness of the loss function. The Lipschitzness of δ_n also comes from the one of the loss function, as follows. Using that $\max \limits \{\cdot , 0\}$ is 1-Lipschitz and that the loss ℓ is M-Lipschitz, we get

$$ \begin{array}{@{}rcl@{}} |\max\{\ell(y, \varphi(w, x)) - \eta, 0\} &-& \max\{\ell(y, \varphi(w^{\prime}, x)) - \eta^{\prime}, 0\}| \\ &&\le |\ell(y, \varphi(w, x)) - \ell(y, \varphi(w^{\prime}, x))| + |\eta - \eta^{\prime}| \\ &&\le M \|\varphi(w, x) - \varphi(w^{\prime}, x)\| + |\eta - \eta^{\prime}| \\ &&\le M\text{dist}_{\varphi}(w, w^{\prime}) + |\eta - \eta^{\prime}| . \end{array} $$

Then, Eq. 39 simply follows from the triangle inequality, and Claim 1 is proved.

The next step in the proof is, for a given ε > 0

to construct a cover T of W × [0, B], and then
to control the convergence over the points of T, more precisely to control the probability of the event
$$ E_{n}(\varepsilon) = \underset{(w, \eta) \in T}{\bigcap} \left\{ \delta_{n}(w, \eta) \le \varepsilon / 2 \right\} . $$

First, using Assumption 1, we consider T₁ a (ε(1 − p)/(8M))-cover of W with respect to dist_φ. We also consider T₂ a uniform discretization of the line segment [0, B] at width ε(1 + 1/(1 − p))/8. We can introduce the cover of W × [0, B]

$$ T = T_{1} \times T_{2} \subset W \times [0, B]. $$

Since, |T₂| = 8B/((1 + 1/(1 − p))ε), we have that |T| = (8B/((1 + 1/(1 − p))ε))N(ε(1 − p)/(8M)). Note that the event $\left \{ \delta _{n}(w, \eta ) \le \varepsilon / 2 \right \}$ for fixed (w, η) since δ_n(w, η) is measurable, and therefore, E_n(ε) is measurable since it is a finite intersection.

To get uniform convergence, it is sufficient to control what happens at points of T. Indeed, for any (w, η), there exists a point $(w^{\prime }, \eta ^{\prime }) \in T$ such that $\text {dist}_{\varphi }(w, w^{\prime }) \le \varepsilon (1-p)/(8M)$ and $|\eta - \eta ^{\prime }| \le \varepsilon (1+1/(1-p))/ 8$. As a consequence, if the event E_n(ε) holds, then

$$ \begin{array}{@{}rcl@{}} \delta_{n}(w, \eta) &\le& \delta_{n}(w^{\prime}, \eta^{\prime}) + |\delta_{n}(w, \eta) - \delta_{n}(w^{\prime}, \eta^{\prime})| \\ &\overset{(39)}{\le}& \delta_{n}(w^{\prime}, \eta^{\prime}) + 2M/(1-p) \text{dist}_{\varphi}(w, w^{\prime}) + 2(1+1/(1-p)) | \eta - \eta^{\prime}|\\ &\le& \frac{\varepsilon}{2} + \frac{\varepsilon}{4} + \frac{\varepsilon}{4} = \varepsilon. \end{array} $$

This implies that events of interest are included in $\overline {E}_{n}(\varepsilon )$, the complement of E_n(ε); we have indeed

$$ \left\{\underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| > \varepsilon\right\} \subset \left\{\underset{(w,\eta) \in W \times [0, B]}{\sup} \delta_{n}(w, \eta) >\varepsilon\right\} \subset \overline E_{n}(\varepsilon) . $$

Postponing the proof of measurability of these events to Claim 3 at the end of this proof, we have the following bound on the sum of probabilities

$$ \sum\limits_{n=1}^{\infty} \mathbb{P}\left( \underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| > \varepsilon\right) \le \sum\limits_{n=1}^{\infty} \mathbb{P}\left( \overline E_{n}(\varepsilon)\right). $$

(40)

Claim 2

The probabilities of the complements of E_n(ε) are summable, i.e.,

$$ \sum\limits_{n=1}^{\infty} \mathbb{P}\left( \overline E_{n}(\varepsilon)\right) < \infty . $$

This is a direct application of the Hoeffding’s inequality (see e.g. [55, Theorem 2.2.2]) as follows. For any fixed (w, η) ∈ W × [0, B], the Hoeffding’s inequality gives

$$ \mathbb{P}(|\delta_{n}(w, \eta)| > \varepsilon/2) \le 2 \exp\left( - \frac{n\varepsilon^{2}}{2B^{2}}\right) . $$

Applied to all (w, η) ∈ T, this yields

$$ \mathbb{P}\big(\overline E_{n}(\varepsilon)\big) \le 2|T|\exp\left( - \frac{n\varepsilon^{2}}{2B^{2}}\right) = \frac{16B}{((1+1/(1-p))\varepsilon} N\left( \frac{\varepsilon(1-p)}{8M}\right) \exp\left( - \frac{n\varepsilon^{2}}{2B^{2}}\right) . $$

and proves Claim 2.

We conclude on the uniform convergence (38) with the Borel-Cantelli Lemma by the classical rationale (see e.g. the textbook [37, Chap. 2, Sec. 6]): the bound (40) and Claim 2 give that the probabilities for any 𝜖 are summable; applying Borel-Cantelli with the sequence 𝜖_k = 1/k gives the uniform convergence (38), which completes the proof of the theorem.

Finally, it remains to show measurability of some events of interest.

Claim 3

The following events are measurable for each ε > 0:

$$ \begin{array}{@{}rcl@{}} E^{\prime}_{n}(\varepsilon) &:=& \left\{\underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| > \varepsilon\right\} , \\ E^{\prime\prime}_{n}(\varepsilon) &:=& \left\{\underset{(w,\eta) \in W \times [0, B]}{\sup} \delta_{n}(w, \eta) >\varepsilon\right\} . \end{array} $$

We prove the claim for $E^{\prime }_{n}(\varepsilon )$ and the second one is entirely analogous. Since the set $\mathbb {Q}^{d}$ of d-dimensional rationals is dense in $\mathbb {R}^{d}$ and the map $w \mapsto |S^{p}_{n}(w) - S^{p}(w)|$ is continuous, we have that

$$ \underset{w \in W}{\sup} |S^{p}_{n}(w) - S^{p}(w)| = \underset{w \in W \cap \mathbb{Q}^{d}}{\sup} |S^{p}_{n}(w) - S^{p}(w)| . $$

Since the latter term is a supremum over a countable set of measurable random variables, we get that $E^{\prime }_{n}(\varepsilon )$ is measurable.

Appendix B: Numerical Illustrations

We provide simple illustrations of the interest of using superquantile for machine learning. More precisely, we reproduce the experimental framework of the computational experiments of [9] and we solve the superquantile optimization problems with the approach depicted here, by combining smoothing and quasi-Newton. For additional experiments with other datasets, metrics, and contexts, we refer to [9].

We consider two basic machine learning tasks (regression and classification) with linear prediction functions φ(w, x) = w^⊤x and with two standard datasets, from the UCI ML repository. Denoting these datasets P_n = {(x_i, y_i)}_1≤i≤n, we introduce the (regularized) empirical risk minimization

$$ \underset{w \in \mathbb{R}^{d}}{\min} ~~\mathbb{E}_{(x,y)\sim P_{n}}\left[\ell(y, {w}^{\top} x)\right] +\frac{1}{2n}\|w\|^{2} , $$

and its smoothed superquantile analogous

$$ \underset{w \in \mathbb{R}^{d}}{\min} ~~{[\mathbb{S}^{\nu}_{p}]}_{(x,y)\sim P_{n}}\left[\ell(y, {w}^{\top} x) \right] +\frac{1}{2n}\|w\|^{2}. $$

We solve these problems using L-BFGS via the toolbox SPQR [22] offering an simple user-interface and implementing the oracles (with the Euclidean smoothing of Example 5 for the smoothed approximation).

Regression and Least-Squares

We consider a regularised least square regression on the dataset Abalone from the UCI Machine learning repository. We perform a 80%/20% train-test split on the dataset. We minimize the least-squares loss on the training set both in expectation and with respect to the superquantile (with p = 0.98 and ν = 0.1).

We report on Fig. 7 the distribution of errors $|y_{i} - {w}^{\top } x_{i}|$ for the testing dataset for both models w (standard in blue and superquantile in red). We observe that the superquantile model exhibits a thinner upper tail than the risk-neutral model, which is quantified by the shift to the left of 0.98 quantile. This comes at the price of lower performance in expectations than the model trained with expectation, which is clear visible on the picture and quantified by the shift to the right of the mean.

Classification and Logistic regression

We consider a logistic regression on the Australian Credit dataset. We randomly split the dataset with a 80%/20% train-test split for 5 different seeds. For each seed, we perform a pessimistic distributional shift on the training dataset by downsampling the majority class (similarly to what is done in [9, Sec. 5.2]). More precisely, we remove an important fraction of the majority class, randomly selected, so that it counts afterward for only 10% of the minority class. We tune then the safety level parameter p by a k-cross validation on the shifted dataset and select the safety parameter yielding the best validation accuracy. The grid we use for tuning this parameter is [0.8, 0.85, 0.9, 0.95, 0.99] We finally compute with this parameter the testing accuracy and the testing precision.

We report the testing accuracy and the testing precision averaged over the 5 different seeds on the table of Fig. 8 with the associated standard deviation. We observe that the superquantile model brings better performance for both in terms of accuracy and precision than the standard model.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Laguel, Y., Pillutla, K., Malick, J. et al. Superquantiles at Work: Machine Learning Applications and Efficient Subgradient Computation. Set-Valued Var. Anal 29, 967–996 (2021). https://doi.org/10.1007/s11228-021-00609-w

Download citation

Received: 02 November 2020
Accepted: 14 September 2021
Published: 30 December 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11228-021-00609-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Superquantiles at Work: Machine Learning Applications and Efficient Subgradient Computation

Abstract

Access this article

Similar content being viewed by others

A Stochastic Subgradient Method for Distributionally Robust Non-convex and Non-smooth Learning

A Stochastic Primal-Dual Method for Optimization with Conditional Value at Risk Constraints

On the pervasiveness of difference-convexity in optimization and statistics

References

Acknowledgements