Skip to main content
Log in

Relative deviation learning bounds and generalization with unbounded loss functions

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

We present an extensive analysis of relative deviation bounds, including detailed proofs of two-sided inequalities and their implications. We also give detailed proofs of two-sided generalization bounds that hold in the general case of unbounded loss functions, under the assumption that a moment of the loss is bounded. We then illustrate how to apply these results in a sample application: the analysis of importance weighting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Anthony, M., Shawe-Taylor, J.: A result of Vapnik with applications. Discret. Appl. Math. 47, 207–217 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  2. Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press (1999)

  3. Azuma, K.: Weighted sums of certain dependent random variables. Tohoku Math. J. 19(3), 357–367 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bartlett, P.L., Boucheron, S., Lugosi, G.: Model selection and error estimation. Mach. Learn. 48(1–3), 85–113 (2002)

    Article  MATH  Google Scholar 

  5. Bartlett, P.L., Bousquet, O., Mendelson, S.: Localized Rademacher complexities. In: COLT, vol. 2375, pp. 79–97. Springer (2002)

  6. Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3 (2002)

  7. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. NIPS (2007)

  8. Bercu, B., Gassiat, E., Rio, E., et al.: Concentration inequalities, large and moderate deviations for self-normalized empirical processes. Ann. Probab. 30(4), 1576–1604 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bickel, S., Brückner, M, Scheffer, T.: Discriminative learning for differing training and test distributions. In: ICML, pp. 81–88 (2007)

  10. Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Wortman, J.: Learning bounds for domain adaptation. NIPS 2007 (2008)

  11. Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification: a survey of some recent advances. ESAIM: Probab. Statist. 9, 323–375 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  12. Cortes, C., Mansour, Y., Mohri, M.: Learning bounds for importance weighting. In: Advances in Neural Information Processing Systems (NIPS 2010), Vancouver. MIT Press (2010)

  13. Cortes, C., Mansour, Y., Mohri, M.: Learning bounds for importance weighting. In: Advances in Neural Information Processing Systems, pp. 442–450 (2010)

  14. Cortes, C., Mohri, M.: Domain adaptation and sample bias correction theory and algorithm for regression. Theor. Comput. Sci. 519, 9474 (2013)

    MathSciNet  Google Scholar 

  15. Cortes, C., Mohri, M., Riley, M., Rostamizadeh, A.: Sample selection bias correction theory. In: ALT (2008)

  16. Cortes, C., Mohri, M., Riley, M., Rostamizadeh, A.: Sample selection bias correction theory. In: International Conference on Algorithmic Learning Theory, pp. 38–53. Springer (2008)

  17. Dasgupta, S., Long, P.M.: Boosting with diverse base classifiers. In: COLT (2003)

  18. Daumé, H. III, Marcu, D.: Domain adaptation for statistical classifiers. J. Artif. Intell. Res. 26, 101–126 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  19. Dudík, M, Schapire, R.E., Phillips, S.J.: Correcting sample selection bias in maximum entropy density estimation. In: NIPS (2006)

  20. Dudley, R. M.: A course on empirical processes. Lect. Notes Math. 1097, 2–142 (1984)

    MathSciNet  MATH  Google Scholar 

  21. Dudley, R. M.: Universal Donsker classes and metric entropy. Ann. Probab. 14 (4), 1306–1326 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  22. Greenberg, S., Mohri, M.: Tight lower bound on the probability of a binomial exceeding its expectation. Technical Report 2013-957, Courant Institute, New York (2013)

  23. Haussler, D.: Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf. Comput. 100(1), 78–150 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  24. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)

    Article  MathSciNet  MATH  Google Scholar 

  25. Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., Schölkopf, B: Correcting sample selection bias by unlabeled data. In: NIPS, vol. 19, pp. 601–608 (2006)

  26. Jaeger, S.A.: Generalization bounds and complexities based on sparsity and clustering for convex combinations of functions from random classes. J. Mach. Learn. Res. 6, 307–340 (2005)

    MathSciNet  MATH  Google Scholar 

  27. Jaeschke, D.: The asymptotic distribution of the supremum of the standardized empirical distribution function on subintervals. Ann. Stat. 7, 108–115 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  28. Jiang, J., Zhai, C.X.: Instance Weighting for Domain Adaptation in NLP. In: ACL (2007)

  29. Koltchinskii, V.: Local rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6) (2006)

  30. Koltchinskii, V., Panchenko, D.: Rademacher processes and bounding the risk of function learning. In: High Dimensional Probability II, pp. 443–459 (2000)

  31. Koltchinskii, V., Panchenko, D.: Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30, 1–50 (2002)

    MathSciNet  MATH  Google Scholar 

  32. Kutin, S., Niyogi, P.: Almost-everywhere algorithmic stability and generalization error. arXiv preprint arXiv:1301.0579 (2012)

  33. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation: learning bounds and algorithms. In: COLT (2009)

  34. Massart, P., Nédélec, É, et al.: Risk bounds for statistical learning. Ann. Stat. 34(5), 2326–2366 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  35. McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics, pp. 148–188. Cambridge University Press (1989)

  36. Meir, R., Zhang, T.: Generalization Error Bounds for Bayesian Mixture Algorithms. J. Mach. Learn. Res. 4, 839–860 (2003)

    MathSciNet  MATH  Google Scholar 

  37. Mendelson, S.: Learning without concentration. In: Conference on Learning Theory, pp. 25–39 (2014)

  38. Panchenko, D.: Symmetrization approach to concentration inequalities for empirical processes. Ann. Probab. 2068–2081 (2003)

  39. Peña, V H, Lai, T.L., Shao, Q-M: Self-Normalized Processes. Springer, Berlin (2008)

    Google Scholar 

  40. Pollard, D.: Convergence of Stochastic Processess. Springer, New York (1984)

    Book  Google Scholar 

  41. Pollard, D.: Asymptotics via empirical processes. Stat. Sci. 4(4), 341–366 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  42. Sauer, N.: On the density of families of sets. J Combinat Theory, Ser A 13(1), 145–147 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  43. Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Infer. 90(2), 227–244 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  44. Steinwart, I., Scovel, C., et al.: Fast rates for support vector machines using gaussian kernels. Ann. Stat. 35(2), 575–607 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  45. Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., Kawanabe, M.: Direct importance estimation with model selection and its application to covariate shift adaptation. In: NIPS (2008)

  46. Talagrand, M.: Sharper bounds for gaussian and empirical processes. Ann. Probab. 22(1), 28–76 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  47. Vapnik, V.N.: Statistical Learning Theory. Wiley (1998)

    MATH  Google Scholar 

  48. Vapnik, V.N.: Estimation of Dependences Based on Empirical Data, 2nd edn. Springer, Berlin (2006)

    MATH  Google Scholar 

  49. Vapnik, V.N., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264 (1971)

    Article  MATH  Google Scholar 

  50. Wang, Y., Singh, A.: Noise-adaptive margin-based active learning and lower bounds under tsybakov noise condition. In: AAAI, pp. 2180–2186 (2016)

Download references

Acknowledgments

We thank the reviewers for several careful and very useful comments. This work was partly funded by NSF CCF-1535987 and NSF IIS-1618662.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Spencer Greenberg.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Lemmas in Support of Section 3

Lemma 4

Let 1 < α ≤ 2 and for any η > 0, let \(f\colon (0, +\infty ) \times (0, +\infty ) \to \mathbb {R}\) be the function defined by \(f\colon (x, y) \mapsto \frac {x - y}{\sqrt [\alpha ]{x + y + \eta }}\). Then, f is a strictly increasing function of x and a strictly decreasing function of y.

Proof

f is differentiable over its domain of definition and for all \((x, y) \in (0, +\infty ) \times (0, +\infty )\),

$$\begin{array}{@{}rcl@{}} && \frac{\partial f}{\partial x}(x, y) = \frac{(x + y + \eta)^{\frac{1}{\alpha}} - \frac{x - y}{\alpha} (x + y + \eta)^{\frac{1}{\alpha} - 1}}{(x + y + \eta)^{\frac{2}{\alpha}}} = \frac{\frac{\alpha - 1}{\alpha} x + \frac{\alpha + 1}{\alpha} y + \eta}{(x + y + \eta)^{1 + \frac{1}{\alpha}}} > 0\\ && \frac{\partial f}{\partial y}(x, y) = \frac{- (x + y + \eta)^{\frac{1}{\alpha}} - \frac{x - y}{\alpha} (x + y + \eta)^{\frac{1}{\alpha} - 1}}{(x + y + \eta)^{\frac{2}{\alpha}}} = -\frac{\frac{\alpha + 1}{\alpha} x + \frac{\alpha - 1}{\alpha} y + \eta}{(x + y + \eta)^{1 + \frac{1}{\alpha}}} < 0. \end{array} $$

Proofs in Support of Section 4

3.1 Proof of Theorem 3

Proof

We prove the first statement. The second statement can be shown in a very similar way. Fix 1 < α ≤ 2 and 𝜖 > 0 and \(\mathcal {S}\) assume that for any hH and t ≥ 0, the following holds:

$$ \frac{\Pr[L(h, z) > t] - \widehat{\Pr}[L(h, z) > t] }{\sqrt[\alpha]{\Pr[L(h, z) > t] + \tau}} \leq \epsilon. $$
(17)

We show that this implies that for any hH, \(\frac {\mathcal {L}(h) - \widehat {\mathcal {L}}_{S}(h)}{\sqrt [\alpha ]{\mathcal {L}_{\alpha }(h) + \tau }} \leq {\Gamma }(\alpha , \epsilon ) \epsilon \). By the properties of the Lebesgue integral, we can write

$$\begin{array}{@{}rcl@{}} && \mathcal{L}(h) = \mathrm{E}_{z \sim D}[L(h, z)] = {\int}_{0}^{+\infty} \Pr[L(h, z) > t] dt \\ && \widehat{\mathcal{L}}(h) = \mathrm{E}_{z \sim \widehat{D}}[L(h, z)] = {\int}_{0}^{+\infty} \widehat{\Pr}[L(h, z) > t] dt, \end{array} $$

and, similarly,

$$\mathcal{L}_{\alpha}(h) = \mathcal{L}_{\alpha}(h) = {\int}_{0}^{+\infty} \Pr[L^{\alpha}(h, z) > t] dt = {\int}_{0}^{+\infty} \alpha t^{\alpha - 1} \Pr[L(h, z) > t] dt. $$

In what follows, we use the notation \(I_{\alpha } = \mathcal {L}_{\alpha }(h) + \tau \). Let \(t_{0} = s I_{\alpha }^{\frac {1}{\alpha }}\) and \(t_{1} = t_{0} \left [ \frac {1}{\epsilon } \right ]^{\frac {1}{\alpha - 1}}\) for s > 0. To bound \(\mathcal {L}(h) - \widehat {\mathcal {L}}(h)\), we simply bound \(\Pr [L(h, z) > t] - \widehat {\Pr }[L(h, z) > t]\) by \(\Pr [L(h, z) > t]\) for large values of t, that is t > t1, and use inequality (17) for smaller values of t:

$$\begin{array}{@{}rcl@{}} \mathcal{L}(h) - \widehat{\mathcal{L}}(h) &=& {\int}_{0}^{+\infty} \Pr[L(h, z) > t] - \widehat{\Pr}[L(h, z) > t] dt\\ & \leq& {\int}_{0}^{t_{1}} \epsilon \sqrt[\alpha]{\Pr[L(h, z) > t] + \tau} dt + {\int}_{t_{1}}^{+\infty} \Pr[L(h, z) > t] dt. \end{array} $$

For relatively small values of t, \(\Pr [L(h, z) > t]\) is close to one. Thus, we can write

$$\begin{array}{@{}rcl@{}} \mathcal{L}(h) - \widehat{\mathcal{L}}(h) &\!\leq\!& {\int}_{0}^{t_{0}} \epsilon \sqrt[\alpha]{1 + \tau} dt + {\int}_{t_{0}}^{t_{1}} \epsilon \sqrt[\alpha]{\Pr[L(h, z) \!>\! t] + \tau} dt + {\int}_{t_{1}}^{+\infty} \Pr[L(h, z) \!>\! t]dt \\ &=& {\int}_{0}^{+\infty} f(t) g(t) dt, \end{array} $$

with

$$f(t) = \left\{\begin{array}{ll} \gamma_{1} I_{\alpha}^{\frac{\alpha - 1}{\alpha^{2}}} \epsilon \sqrt[\alpha]{1 + \tau} & \text{if } 0 \leq t \leq t_{0}\\ \gamma_{2} \left[ \alpha t^{\alpha - 1} (\Pr[L(h, z) > t] + \tau) \right]^{\frac{1}{\alpha}} \epsilon & \text{if } t_{0}< t \leq t_{1}\\ \gamma_{2} \left[ \alpha t^{\alpha - 1} \Pr[L(h, z) > t] \right]^{\frac{1}{\alpha}} \epsilon & \text{if } t_{1} < t. \end{array}\right. \quad g(t) = \left\{\begin{array}{ll} \frac{1}{\gamma_{1} I_{\alpha}^{\frac{\alpha - 1}{\alpha^{2}}}} & \text{if } 0 \leq t \leq t_{0}\\ \frac{1}{\gamma_{2} (\alpha t^{\alpha - 1})^{\frac{1}{\alpha}}} & \text{if } t_{0} < t \leq t_{1}\\ \frac{\Pr[L(h, z) > t]^{\frac{\alpha - 1}{\alpha}}}{\gamma_{2} (\alpha t^{\alpha - 1})^{\frac{1}{\alpha}}} \frac{1}{\epsilon} & \text{if } t_{1} < t, \end{array}\right. $$

where γ1,γ2 are positive parameters that we shall select later. Now, since α > 1, by Hölder’s inequality,

$$\begin{array}{@{}rcl@{}} \mathcal{L}(h) - \widehat{\mathcal{L}}(h) & \leq& \left[ {\int}_{0}^{+\infty} f(t)^{\alpha} dt \right]^{\frac{1}{\alpha}} \left[ {\int}_{0}^{+\infty} g(t)^{\frac{\alpha}{\alpha - 1}} dt \right]^{\frac{\alpha - 1}{\alpha}}. \end{array} $$

The first integral on the right-hand side can be bounded as follows:

$$\begin{array}{@{}rcl@{}} {\int}_{0}^{+\infty} f(t)^{\alpha} dt & = & {\int}_{0}^{t_{0}} (1\! +\! \tau) (\gamma_{1} I_{\alpha}^{\frac{\alpha - 1}{\alpha^{2}}} \epsilon)^{\alpha} dt + \gamma_{2}^{\alpha} \epsilon^{\alpha} \tau {\int}_{t_{0}}^{t_{1}} \alpha t^{\alpha - 1} dt + \gamma_{2}^{\alpha} {\int}_{t_{0}}^{+\infty} \alpha t^{\alpha - 1} \Pr[L(h, z) > t] \epsilon^{\alpha} dt\\ & \!\leq\!& (1 + \tau) \gamma_{1}^{\alpha} I_{\alpha}^{\frac{\alpha - 1}{\alpha}} t_{0} \epsilon^{\alpha} + \gamma_{2}^{\alpha} \epsilon^{\alpha} \tau (t_{1}^{\alpha} - t_{0}^{\alpha})+ \gamma_{2}^{\alpha} \epsilon^{\alpha} I_{\alpha} \\ & \!\leq\!& (\gamma_{1}^{\alpha} (1 + \tau) s + \gamma_{2}^{\alpha} (1 + s^{\alpha} (1/\epsilon)^{\frac{\alpha}{\alpha - 1}} \tau) ) \epsilon^{\alpha} I_{\alpha}\\ & \!\leq\!& (\gamma_{1}^{\alpha} (1 + \tau) s + \gamma_{2}^{\alpha} (1 + s^{\alpha} \tau^{\frac{1}{\alpha}})) \epsilon^{\alpha} I_{\alpha}. \end{array} $$

Since \(t_{1}/t_{0} = (1/\epsilon )^{\frac {1}{\alpha - 1}}\), the second one can be computed and bounded following

$$\begin{array}{@{}rcl@{}} {\int}_{0}^{+\infty} g(t)^{\frac{\alpha}{\alpha - 1}} dt & =& {\int}_{0}^{t_{0}} \frac{dt}{\gamma_{1}^{\frac{\alpha}{\alpha - 1}} I_{\alpha}^{\frac{1}{\alpha}}} + {\int}_{t_{0}}^{t_{1}} \frac{1}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}} \alpha^{\frac{1}{\alpha - 1}} } \frac{dt}{t} + {\int}_{t_{1}}^{+\infty} \frac{\Pr[L(h, z) > t]}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}} \alpha^{\frac{1}{\alpha - 1}} \epsilon^{\frac{\alpha}{\alpha - 1}} t} dt\\ & = &\frac{s}{\gamma_{1}^{\frac{\alpha}{\alpha - 1}}} + \frac{1}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}} (\alpha - 1) \alpha^{\frac{1}{\alpha - 1}} } \log \frac{1}{\epsilon} + {\int}_{t_{1}}^{+\infty} \frac{\alpha t^{\alpha - 1} \Pr[L(h, z) > t]}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}} (\alpha \epsilon)^{\frac{\alpha}{\alpha - 1}} t^{\alpha} } dt\\ & \leq &\frac{s}{\gamma_{1}^{\frac{\alpha}{\alpha - 1}}} + \frac{1}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}} (\alpha - 1) \alpha^{\frac{1}{\alpha - 1}} } \log \frac{1}{\epsilon} + {\int}_{t_{1}}^{+\infty} \frac{\alpha t^{\alpha - 1} \Pr[L(h, z) > t]}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}} (\alpha \epsilon)^{\frac{\alpha}{\alpha - 1}} t_{1}^{\alpha}} dt\\ & \leq& \frac{s}{\gamma_{1}^{\frac{\alpha}{\alpha - 1}}} + \frac{1}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}} (\alpha - 1) \alpha^{\frac{1}{\alpha - 1}} } \log \frac{1}{\epsilon} + \frac{I_{\alpha}}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}} (\alpha \epsilon)^{\frac{\alpha}{\alpha - 1}} s^{\alpha} I_{\alpha} (\frac{1}{\epsilon})^{\frac{\alpha}{\alpha - 1}}} \\ & =& \frac{s}{\gamma_{1}^{\frac{\alpha}{\alpha - 1}}} + \frac{1}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}}} \left( \frac{1}{(\alpha - 1) \alpha^{\frac{1}{\alpha - 1}} } \log \frac{1}{\epsilon} + \frac{1}{\alpha^{\frac{\alpha}{\alpha - 1}} s^{\alpha} } \right). \end{array} $$

Combining the bounds obtained for these integrals yields directly

$$\begin{array}{@{}rcl@{}} &&\mathcal{L}(h) - \widehat{\mathcal{L}}(h) \\ & \leq& \left[ (\gamma_{1}^{\alpha} (1 + \tau) s + \gamma_{2}^{\alpha} (1 + s^{\alpha} \tau^{\frac{1}{\alpha}})) \epsilon^{\alpha} I_{\alpha} \right]^{\frac{1}{\alpha}} \left[ \frac{s}{\gamma_{1}^{\frac{\alpha}{\alpha - 1}}} + \frac{1}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}}} \left( \frac{1}{(\alpha - 1) \alpha^{\frac{1}{\alpha - 1}} } \log \frac{1}{\epsilon} + \frac{1}{\alpha^{\frac{\alpha}{\alpha - 1}} s^{\alpha} } \right) \right]^{\frac{\alpha - 1}{\alpha}} \\ & =& (\gamma_{1}^{\alpha} (1 + \tau) s + \gamma_{2}^{\alpha} (1 + s^{\alpha} \tau^{\frac{1}{\alpha}}))^{\frac{1}{\alpha}} \left[ \frac{s}{\gamma_{1}^{\frac{\alpha}{\alpha - 1}}} + \frac{1}{\gamma_{2}^{\frac{\alpha}{\alpha - 1}}} \left( \frac{1}{(\alpha - 1) \alpha^{\frac{1}{\alpha - 1}} } \log \frac{1}{\epsilon} + \frac{1}{\alpha^{\frac{\alpha}{\alpha - 1}} s^{\alpha} } \right) \right]^{\frac{\alpha - 1}{\alpha}} \epsilon I_{\alpha}^{\frac{1}{\alpha}}. \end{array} $$

Observe that the expression on the right-hand side can be rewritten as \(\| \mathbf {u} \|_{\alpha } \| \mathbf {v} \|_{\frac {\alpha }{\alpha - 1}} \ \epsilon I_{\alpha }^{\frac {1}{\alpha }}\) where the vectors u and v are defined by \(\mathbf {u} = (\gamma _{1} (1 + \tau )^{\frac {1}{\alpha }} s^{\frac {1}{\alpha }}, \gamma _{2} (1 + s^{\alpha } \tau ^{\frac {1}{\alpha }})^{\frac {1}{\alpha }})\) and \(\mathbf {v} = (v_{1}, v_{2}) = \left (\frac {s^{\frac {\alpha - 1}{\alpha }}}{\gamma _{1}}, \frac {1}{\gamma _{2}} \left [ \frac {1}{(\alpha - 1) \alpha ^{\frac {1}{\alpha - 1}} } \log \frac {1}{\epsilon } + \frac {1}{\alpha ^{\frac {\alpha }{\alpha - 1}} s^{\alpha } } \right ]^{\frac {\alpha - 1}{\alpha }} \right )\). The inner product uv does not depend on γ1 and γ2, and equality holds if and only if the vectors u and \(\mathbf {v}^{\prime } = (v_{1}^{\frac {1}{\alpha - 1}}, v_{2}^{\frac {1}{\alpha - 1}})\) are collinear (as we can see by applying Hölder’s inequality).

γ1 and γ2 can be chosen so that \({\det }(\mathbf {u}, \mathbf {v}^{\prime }) = 0\), since this condition can be rewritten as

$$ s^{\frac{1}{\alpha}} (1 + \tau)^{\frac{1}{\alpha}} \frac{\gamma_{1}}{\gamma_{2}^{\frac{1}{\alpha - 1}}} \left[ \frac{1}{(\alpha - 1) \alpha^{\frac{1}{\alpha - 1}} } \log \frac{1}{\epsilon} + \frac{1}{\alpha^{\frac{\alpha}{\alpha - 1}} s^{\alpha} } \right]^{\frac{1}{\alpha}} - s^{\frac{1}{\alpha}} (1 + s^{\alpha} \tau^{\frac{1}{\alpha}})^{\frac{1}{\alpha}} \frac{\gamma_{2}}{\gamma_{1}^{\frac{1}{\alpha - 1}}} = 0, $$
(18)

or equivalently,

$$ \left( \frac{\gamma_{1}}{\gamma_{2}} \right)^{\frac{\alpha}{\alpha - 1}} \left[ \frac{1}{(\alpha - 1) \alpha^{\frac{1}{\alpha - 1}} } \log \frac{1}{\epsilon} + \frac{1}{\alpha^{\frac{\alpha}{\alpha - 1}} s^{\alpha} } \right]^{\frac{1}{\alpha}} - (1 + s^{\alpha} \tau^{\frac{1}{\alpha}})^{\frac{1}{\alpha}} = 0. $$
(19)

Thus, for such values of γ1 and γ2, the following inequality holds:

$$\begin{array}{@{}rcl@{}} \mathcal{L}(h) - \widehat{\mathcal{L}}(h) & \leq (\mathbf{u} \cdot \mathbf{v}) \epsilon I_{\alpha}^{\frac{1}{\alpha}} = f(s) \epsilon I_{\alpha}^{\frac{1}{\alpha}}, \end{array} $$

with

$$\begin{array}{@{}rcl@{}} f(s) & =& (1 + \tau)^{\frac{1}{\alpha}} s + (1 + s^{\alpha} \tau^{\frac{1}{\alpha}})^{\frac{1}{\alpha}} \left[ \frac{1}{(\alpha - 1) \alpha^{\frac{1}{\alpha - 1}} } \log \frac{1}{\epsilon} + \frac{1}{\alpha^{\frac{\alpha}{\alpha - 1}} s^{\alpha} } \right]^{\frac{\alpha - 1}{\alpha}} \\ & =& (1 + \tau)^{\frac{1}{\alpha}} s + \frac{(1 + s^{\alpha} \tau^{\frac{1}{\alpha}})^{\frac{1}{\alpha}}}{\alpha} \left[ \frac{\alpha}{(\alpha - 1) } \log \frac{1}{\epsilon} + \frac{1}{s^{\alpha} } \right]^{\frac{\alpha - 1}{\alpha}}. \end{array} $$

Setting \(s = \frac {\alpha - 1}{\alpha }\) yields the statement of the theorem. □

3.2 Proof of Proposition 1

Proof

We prove the first inequality. The second can be proven in a very similar way. Fix α > 2 and hH. As in the proof of Theorem 3, we bound \(\Pr [L(h, z) > t]\) by 1 for t close to 0, say tt0 for some t0 > 0 that we shall later determine. We can write

$${\int}_{0}^{+\infty} \sqrt{\Pr[L(h, z) > t]} dt \leq {\int}_{0}^{t_{0}} 1 dt + {\int}_{t_{0}}^{+\infty} \sqrt{\Pr[L(h, z) > t]} dt = {\int}_{0}^{+\infty} f(t) g(t) dt, $$

with functions f and g defined as follows:

$$f(t) = \left\{\begin{array}{ll} \gamma I_{\alpha}^{\frac{\alpha - 1}{2 \alpha}} & \text{if } 0 \leq t \leq t_{0}\\ \alpha^{\frac{1}{2}} t^{\frac{\alpha - 1}{2}} \Pr[L(h, z) > t]^{\frac{1}{2}} & \text{if } t_{0} < t. \end{array}\right. \quad g(t) = \left\{\begin{array}{ll} \frac{1}{\gamma I_{\alpha}^{\frac{\alpha - 1}{2 \alpha}}} & \text{if } 0 \leq t \leq t_{0}\\ \frac{1}{\alpha^{\frac{1}{2}} t^{\frac{\alpha - 1}{2}}} & \text{if } t_{0} < t, \end{array}\right. $$

where \(I_{\alpha } = \mathcal {L}_{\alpha }(h)\) and where γ is a positive parameter that we shall select later. By the Cauchy-Schwarz inequality,

$${\int}_{0}^{+\infty} \sqrt{\Pr[L(h, z) > t]} dt \leq \left( {\int}_{0}^{+\infty} f(t)^{2} dt \right)^{\frac{1}{2}} \left( {\int}_{0}^{+\infty} g(t)^{2} dt \right)^{\frac{1}{2}}. $$

Thus, we can write

$$\begin{array}{@{}rcl@{}} &&{\int}_{0}^{+\infty} \sqrt{\Pr[L(h, z) > t]} dt \\ &\leq& \left( \gamma^{2} I_{\alpha}^{\frac{\alpha - 1}{\alpha}} t_{0} + {\int}_{t_{0}}^{+\infty} \alpha t^{\alpha - 1} \Pr[L(h, z) > t] dt \right)^{\frac{1}{2}} \left( \frac{t_{0}}{\gamma^{2} I_{\alpha}^{\frac{\alpha - 1}{\alpha}}} + {\int}_{t_{0}}^{+\infty} \frac{1}{\alpha t^{\alpha - 1}} dt \right)^{\frac{1}{2}}\\ &\leq& \left( \gamma^{2} I_{\alpha}^{\frac{\alpha - 1}{\alpha}} t_{0} + I_{\alpha} \right)^{\frac{1}{2}} \left( \frac{t_{0}}{\gamma^{2} I_{\alpha}^{\frac{\alpha - 1}{\alpha}}} + \frac{1}{\alpha (\alpha - 2) t_{0}^{\alpha - 2}} \right)^{\frac{1}{2}}. \end{array} $$

Introducing t1 with \(t_{0} = I_{\alpha }^{1/\alpha } t_{1}\) leads to

$$\begin{array}{@{}rcl@{}} {\int}_{0}^{+\infty} \sqrt{\Pr[L(h, z) > t]} dt & \leq& \left( \gamma^{2} I_{\alpha} t_{1} + I_{\alpha} \right)^{\frac{1}{2}} \left( \frac{t_{1}}{\gamma^{2} I_{\alpha}^{\frac{\alpha - 2}{\alpha}}} + \frac{1}{\alpha (\alpha - 2) t_{1}^{\alpha - 2} I_{\alpha}^{\frac{\alpha - 2}{\alpha}}} \right)^{\frac{1}{2}}\\ & \leq& \left( \gamma^{2} t_{1} + 1 \right)^{\frac{1}{2}} \left( \frac{t_{1}}{\gamma^{2}} + \frac{1}{\alpha (\alpha - 2) t_{1}^{\alpha - 2}} \right)^{\frac{1}{2}} I_{\alpha}^{\frac{1}{\alpha}}. \end{array} $$

We now seek to minimize the expression \(\left (\gamma ^{2} t_{1} + 1 \right )^{\frac {1}{2}} \left (\frac {t_{1}}{\gamma ^{2}} + \frac {1}{\alpha (\alpha - 2) t_{1}^{\alpha - 2}} \right )^{\frac {1}{2}}\), first as a function of γ. This expression can be viewed as the product of the norms of the vectors \(\mathbf {u} = (\gamma t_{1}^{\frac {1}{2}}, 1)\) and \(\mathbf {v} = (\frac {t_{1}^{\frac {1}{2}}}{\gamma }, \frac {1}{\sqrt {\alpha (\alpha - 2)} t_{1}^{\frac {\alpha - 2}{2}}})\), with a constant inner product (not depending on γ). Thus, by the properties of the Cauchy-Schwarz inequality, it is minimized for collinear vectors and in that case equals their inner product:

$$\mathbf{u} \cdot \mathbf{v} = t_{1} + \frac{1}{\sqrt{\alpha (\alpha - 2)} t_{1}^{\frac{\alpha - 2}{2}}}. $$

Differentiating this last expression with respect to t1 and setting the result to zero gives the minimizing value of t1: \((\frac {2}{\alpha - 2} \sqrt {\alpha (\alpha - 2)})^{-\frac {2}{\alpha }} = \left (\frac {1}{2} \sqrt {\frac {\alpha - 2}{\alpha }} \right )^{\frac {2}{\alpha }}\). For that value of t1,

$$\mathbf{u} \cdot \mathbf{v} = \left( 1 + \frac{2}{\alpha - 2} \right) t_{1} = \frac{\alpha}{\alpha - 2} \left( \frac{1}{2} \sqrt{\frac{\alpha - 2}{\alpha}} \right)^{\frac{2}{\alpha}} = \left( \frac{1}{2} \right)^{\frac{2}{\alpha}} \left( \frac{\alpha - 2}{\alpha} \right)^{\frac{1 - \alpha}{\alpha}}, $$

which concludes the proof. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cortes, C., Greenberg, S. & Mohri, M. Relative deviation learning bounds and generalization with unbounded loss functions. Ann Math Artif Intell 85, 45–70 (2019). https://doi.org/10.1007/s10472-018-9613-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-018-9613-y

Keywords

Mathematics Subject Classification (2010)

Navigation