Skip to main content
Log in

Regularization Networks and Support Vector Machines

  • Published:
Advances in Computational Mathematics Aims and scope Submit manuscript

Abstract

Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples – in particular, the regression problem of approximating a multivariate function from sparse data. Radial Basis Functions, for example, are a special case of both regularization and Support Vector Machines. We review both formulations in the context of Vapnik's theory of statistical learning which provides a general foundation for the learning problem, combining functional analysis and statistics. The emphasis is on regression: classification is treated as a special case.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. D. Allen, The relationship between variable selection and data augmentation and a method for prediction, Technometrics 16 (1974) 125–127.

    Article  MATH  MathSciNet  Google Scholar 

  2. N. Alon, S. Ben-David, N. Cesa-Bianchi and D. Haussler, Scale-sensitive-dimensions, uniform convergence, and learnability, in: Symposium on Foundations of Computer Science(1993).

  3. S. Amari, A. Cichocki and H. Yang, A new learning algorithm for bling signal separation, in: Advances in Neural Information Processing System(MIT Press, Cambridge, MA, 1995) pp. 757–763.

    Google Scholar 

  4. S.I. Amari, Natural gradient works efficiently in learning, Neural Comput. 10 (1998) 251–276.

    Article  Google Scholar 

  5. N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 686 (1950) 337–404.

    Article  MathSciNet  Google Scholar 

  6. P. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important that the size of the network, IEEE Trans. Inform. Theory (1998).

  7. P. Bartlett, P.M. Long and R.C. Williamson, Fat-shattering and the learnability of real-valued functions, J. Comput. Systems Sci. 52(3) (1996) 434–452.

    Article  MATH  MathSciNet  Google Scholar 

  8. P. Bartlett and J. Shawe-Taylor, Generalization performance of support vector machine and other patern classifiers, in: Advances in Kernel Methods-Support Vector Learning, eds. C. Burges and B. Scholkopf (MIT Press, Cambridge, MA, 1998).

    Google Scholar 

  9. A.J. Bell and T. Sejnowski, An information maximization approach to blind separation and blind deconvolution, Neural Comput. 7 (1995) 1129–1159.

    Google Scholar 

  10. M. Bertero, Regularization methods for linear inverse problems, in: Inverse Problems, ed. C.G. Talenti (Springer, Berlin, 1986).

    Google Scholar 

  11. M. Bertero, T. Poggio and V. Torre, Ill-posed problems in early vision, Proc. IEEE 76 (1988) 869–889.

    Article  Google Scholar 

  12. L. Bottou and V. Vapnik, Local learning algorithms, Neural Comput. 4(6) (1992) 888–900.

    Google Scholar 

  13. M.D. Buhmann, Multivariate cardinal interpolation with radial basis functions, Constr. Approx. 6 (1990) 225–255.

    Article  MATH  MathSciNet  Google Scholar 

  14. M.D. Buhmann, On quasi-interpolation with Radial Basis Functions, Numerical Analysis Reports DAMPT 1991/NA3, Department of Applied Mathematics and Theoretical Physics, Cambridge, England (March 1991).

    Google Scholar 

  15. C. Burges, A tutorial on support vector machines for pattern recognition, in: Data Mining and Knowledge Discovery, Vol. 2 (Kluwer Academic, Boston, 1998).

    Google Scholar 

  16. O. Chapelle and V. Vapnik, Model selection for support vector machines, in: Advances in Neural Information Processing Systems(1999).

  17. S. Chen, D. Donoho and M. Saunders, Atomic decomposition by basis pursuit, Technical Report 479, Department of Statistics, Stanford University (May 1995).

  18. S. Chen, Basis Pursuit, Ph.D. thesis, Department of Statistics, Stanford University (November 1995).

  19. V. Cherkassky and F. Mulier, Learning from Data: Concepts, Theory, and Methods(Wiley, New York, 1998).

    MATH  Google Scholar 

  20. J.A. Cochran, The Analysis of Linear Integral Equations(McGraw-Hill, New York, 1972).

    MATH  Google Scholar 

  21. R.R. Coifman and M.V. Wickerhauser, Entropy-based algorithms for best-basis selection, IEEE Trans. Inform. Theory 38 (1992) 713–718.

    Article  MATH  Google Scholar 

  22. C. Cortes and V. Vapnik, Support vector networks, Machine Learning 20 (1995) 1–25.

    Google Scholar 

  23. R. Courant and D. Hilbert, Methods of Mathematical Physics, Vol. 2 (Interscience, London, England, 1962).

    MATH  Google Scholar 

  24. I. Daubechies, Ten Lectures on Wavelets, CBMS-NSF Regional Conferences Series in Applied Mathematics (SIAM, Philadelphia, PA, 1992).

    MATH  Google Scholar 

  25. C. de Boor, Quasi-interpolants and approximation power of multivariate splines, in: Computation of Curves and Surfaces, eds. M. Gasca and C.A. Micchelli (Kluwer Academic, Dordrecht, Netherlands, 1990) pp. 313–345.

    Google Scholar 

  26. R.A. DeVore, Nonlinear approximation, Acta Numerica (1998) 51–150.

  27. L. Devroye, L. Gy¨orfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Applications of Mathematics, Vol. 31 (Springer, New York, 1996).

    Google Scholar 

  28. R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis(Wiley, New York, 1973).

    MATH  Google Scholar 

  29. R.M. Dudley, A course on empirical processes, in: Lecture Notes in Mathematics, Vol. 1097 (1984) pp. 2–142.

    MathSciNet  Google Scholar 

  30. R.M. Dudley, E. Gine and J. Zinn, Uniform and universal Glivenko-Cantelli classes, J. Theoret. Probab. 4 (1991) 485–510.

    Article  MATH  MathSciNet  Google Scholar 

  31. N. Dyn, Interpolation and approximation by radial and related functions, in: Approximation Theory VI, eds. C.K. Chui, L.L. Schumaker and D.J. Ward (Academic Press, New York, 1991) pp. 211–234.

    Google Scholar 

  32. N. Dyn, I.R.H. Jackson, D. Levin and A. Ron, On multivariate approximation by integer translates of a basis function, Computer Sciences Technical Report 886, University of Wisconsin-Madison (November 1989).

    Google Scholar 

  33. N. Dyn, D. Levin and S. Rippa, Numerical procedures for surface fitting of scattered data by radial functions, SIAM J. Sci. Statist. Comput. 7(2) (1986) 639–659.

    Article  MATH  MathSciNet  Google Scholar 

  34. T. Evgeniou, L. Perez-Breva, M. Pontil and T. Poggio, Bounds on the generalization performance of kernel machines ensembles, A.I. Memo, MIT Artificial Intelligence Lab. (1999).

  35. T. Evgeniou and M. Pontil, A note on the generalization performance of kernel classifiers with margin, A.I. Memo, MIT Artificial Intelligence Lab. (1999).

  36. T. Evgeniou and M. Pontil, On the v-gamma-dimension for regression in reproducing kernel Hilbert spaces, A.I. Memo, MIT Artificial Intelligence Lab. (1999).

  37. F. Girosi, Models of noise and robust estimates, A.I. Memo 1287, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1991).

  38. F. Girosi, An equivalence between sparse approximation and Support Vector Machines, Neural Comput. 10(6) (1998) 1455–1480.

    Article  Google Scholar 

  39. F. Girosi, M. Jones and T. Poggio, Regularization theory and neural networks architectures, Neural Comput. 7 (1995) 219–269.

    Google Scholar 

  40. F. Girosi, T. Poggio and B. Caprile, Extensions of a theory of networks for approximation and learning: outliers and negative examples, in: Advances in Neural Information Processings Systems, Vol. 3, eds. R. Lippmann, J. Moody and D. Touretzky (Morgan Kaufmann, San Mateo, CA, 1991).

    Google Scholar 

  41. W. H¨ardle, Applied Nonparametric Regression, Econometric Society Monographs, Vol. 19 (Cambridge University Press, 1990).

  42. G.F. Harpur and R.W. Prager, Development of low entropy coding in a recurrent network, Network 7 (1996) 277–284.

    Article  Google Scholar 

  43. T. Hastie and R. Tibshirani, Generalized Additive Models, Monographs on Statistics and Applied Probability, Vol. 43 (Chapman and Hall, London, 1990).

    Google Scholar 

  44. S. Haykin, Neural Networks: A Comprehensive Foundation(Macmillan, New York, 1994).

    MATH  Google Scholar 

  45. H. Hochstadt, Integral Equations, Wiley Classics Library (Wiley, New York, 1973).

    MATH  Google Scholar 

  46. V.V. Ivanov, The Theory of Approximate Methods and Their Application to the Numerical Solution of Singular Integral Equations(Noordhoff International, Leiden, 1976).

    MATH  Google Scholar 

  47. T. Jaakkola and D. Haussler, Probabilistic kernel regression models, in: Proc. of Neural Information Processing Conference(1998).

  48. I.R.H. Jackson, Radial Basis Functions methods for multivariate approximation, Ph.D. thesis, University of Cambridge, UK (1988).

    Google Scholar 

  49. M. Kearns, Y. Mansour, A. Ng and D. Ron, An experimental and theoretical comparison of model selection methods, in: Proceedings of the 8th Annual ACM Conference on Computational Learning Theory(1995).

  50. M. Kearns and R. E. Shapire, Efficient distribution-free learning of probabilistic concepts, J. Comput. Syst. Sci. 48(3) (1994) 464–497.

    Article  MATH  Google Scholar 

  51. G.S. Kimeldorf and G. Wahba, A correspondence between Bayesan estimation on stochastic processes and smoothing by splines, Ann. Math. Statist. 41(2) (1971) 495–502.

    MathSciNet  Google Scholar 

  52. T.-W. Lee, M. Girolami, A.J. Bell and T. Sejnowski, A unifying information-theoretical framework for independent component analysis, Internat. J. Math. Comput. Mod. (1998) to appear.

  53. M. Lewicki and T. Sejnowski, Learning nonlinear overcomplete representation for efficient coding, in: Advances in Neural Information Processing System(1997).

  54. G.G. Lorentz, Approximation of Functions(Chelsea, New York, 1986).

    MATH  Google Scholar 

  55. D.J.C. MacKay, Introduction to Gaussian processes (1997) (available at the URL: http://wol.ra.phy.cam.ac.uk/mackay).

  56. W.R. Madych and S.A. Nelson, Polyharmonic cardinal splines: a minimization property, J. Approx. Theory 63 (1990) 303–320.

    Article  MATH  MathSciNet  Google Scholar 

  57. S. Mallat and Z. Zhang, Matching Pursuit in a time-frequency dictionary, IEEE Trans. Signal Process. 41 (1993) 3397–3415.

    Article  MATH  Google Scholar 

  58. J.L. Marroquin, S. Mitter and T. Poggio, Probabilistic solution of ill-posed problems in computational vision, J. Amer. Statist. Assoc. 82 (1987) 76–89.

    Article  MATH  Google Scholar 

  59. H.N. Mhaskar, Neural networks for localized approximation of real functions, in: Neural Networks for Signal Processing III, Proceedings of the 1993 IEEE-SP Workshop, eds. C.A. Kamm et al. (IEEE Signal Processing Society, New York, 1993) pp. 190–196.

    Google Scholar 

  60. C.A. Micchelli, Interpolation of scattered data: distance matrices and conditionally positive definite functions, Constr. Approx. 2 (1986) 11–22.

    Article  MATH  MathSciNet  Google Scholar 

  61. P. Niyogi and F. Girosi, On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions, Neural Comput. 8 (1996) 819–842.

    Google Scholar 

  62. P. Niyogi, F. Girosi and T. Poggio, Incorporating prior information in machine learning by creating virt ual examples, Proc. IEEE 86(11) (1998) 2196–2209.

    Article  Google Scholar 

  63. E. Oja, The nonlinear pca learning rule in independent component analysis, Neurocomput. 17 (1997) 25–45.

    Article  Google Scholar 

  64. B. Olshausen, Learning linear, sparse, factorial codes, A.I. Memo 1580, MIT Artificial Intelligence Lab. (1996).

  65. B.A. Olshausen and D.J. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature381 (1996) 607–609.

    Article  Google Scholar 

  66. C. Papageorgiou, F. Girosi and T. Poggio, Sparse correlation kernel based signal reconstruction, Technical Report 1635, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1998). (CBCL Memo 162.)

    Google Scholar 

  67. G. Parisi, Statistical Field Theory(Addison-Wesley, Reading, MA, 1988).

    MATH  Google Scholar 

  68. T. Poggio, On optimal nonlinear associative recall, Biological Cybernetics 19 (1975) 201–209.

    Article  MATH  MathSciNet  Google Scholar 

  69. T. Poggio and F. Girosi, A theory of networks for approximation and learning, A.I. Memo No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1989).

    Google Scholar 

  70. T. Poggio and F. Girosi, Networks for approximation and learning, Proc. IEEE 78(9) (1990).

  71. T. Poggio and F. Girosi, Networks for approximation and learning, in: Foundations of Neural Networks, ed. C. Lau (IEEE Press, Piscataway, NJ, 1992) pp. 91–106.

    Google Scholar 

  72. T. Poggio and F. Girosi, A sparse representation for function approximation, Neural Comput. 10(6) (1998).

  73. T. Poggio, V. Torre and C. Koch, Computational vision and regularization theory, Nature 317 (1985) 314–319.

    Article  Google Scholar 

  74. D. Pollard, Convergence of Stochastic Processes(Springer, Berlin, 1984).

    MATH  Google Scholar 

  75. M. Pontil, S. Mukherjee and F. Girosi, On the noise model of support vector machine regression, A.I. Memo, MIT Artificial Intelligence Laboratory (1998) (in preparation).

  76. M. Pontil, R. Rifkin and T. Evgeniou, From regression to classification in support vector machines, A.I. Memo 1649, MIT Artificial Intelligence Lab. (1998).

  77. M.J.D. Powell, The theory of radial basis functions approximation in 1990, in: Advances in Numerical Analysis Volume II: Wavelets, Subdivision Algorithms and Radial Basis Functions, ed. W.A. Light (Oxford University Press, 1992) pp. 105–210.

  78. C. Rabut, How to build quasi-interpolants. Applications to polyharmonic B-splines, in: Curves and Surfaces, eds. P.-J. Laurent, A. Le M´ehaut´e and L.L. Schumaker (Academic Press, New York, 1991) pp. 391–402.

    Google Scholar 

  79. C. Rabut, An introduction to Schoenberg's approximation, Comput. Math. Appl. 24(12) (1992) 149–175.

    Article  MATH  MathSciNet  Google Scholar 

  80. R. Rifkin, M. Pontil and A. Verri, A note on support vector machine degeneracy, A.I. Memo, MIT Artificial Intelligence Lab. (1999).

  81. J. Rissanen, Modeling by shortest data description, Automatica 14 (1978) 465–471.

    Article  MATH  Google Scholar 

  82. I.J. Schoenberg, Contributions to the problem of approximation of equidistant data by analytic functions, Part A: On the problem of smoothing of graduation, a first class of analytic approximation formulae, Quart. Appl. Math. 4 (1946) 45–99.

    MATH  MathSciNet  Google Scholar 

  83. I.J. Schoenberg, Cardinal interpolation and spline functions, J. Approx. Theory 2 (1969) 167–206.

    Article  MATH  MathSciNet  Google Scholar 

  84. B. Scholkpof, P. Simard, A. Smola and V. Vapnik, Prior knowledge in suport vector kernels, in: Advances in Neural Information Processing Systems 9(1997).

  85. L.L. Schumaker, Spline Functions: Basic Theory(Wiley, New York, 1981).

    MATH  Google Scholar 

  86. J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory (1998) to appear. Also: NeuroCOLT Technical Report NC-TR-96–053 (1996) ftp://ftp.dcs.rhbnc.ac.uk/pub/neurocolt/tech reports.

  87. J. Shawe-Taylor and N. Cristianini, Robust bounds on generalization from the margin distribution, Technical Report NeuroCOLT2, Technical Report NC2–TR-1998–029, NeuroCOLT2 (1998).

  88. B.W. Silverman, Spline smoothing: the equivalent variable kernel method, Ann. Statist. 12 (1984) 898–916.

    MATH  MathSciNet  Google Scholar 

  89. P. Simard, Y. LeCun and J. Denker, Efficient pattern recognition using a new transformation distance, in: Advances in Neural Information Processing Systems 5(1993) pp. 50–58.

    Google Scholar 

  90. A. Smola and B. Sch¨olkopf, On a kernel-based method for pattern recognition, regression, approximation and operator inversion, Algorithmica 22 (1998) 211–231.

    Article  MATH  MathSciNet  Google Scholar 

  91. J. Stewart, Positive definite functions and generalizations, an historical survey, Rocky Mountain J. Math. 6 (1976) 409–434.

    Article  MATH  MathSciNet  Google Scholar 

  92. A.N. Tikhonov and V.Y. Arsenin, Solutions of Ill-posed Problems(W.H. Winston, Washington, DC, 1977).

    MATH  Google Scholar 

  93. L.G. Valiant, A theory of learnable, in: Proc. of the 1984 STOC(1984) pp. 436–445.

  94. V.N. Vapnik, Estimation of Dependences Based on Empirical Data(Springer, Berlin, 1982).

    MATH  Google Scholar 

  95. V.N. Vapnik, The Nature of Statistical Learning Theory(Springer, New York, 1995).

    MATH  Google Scholar 

  96. V.N. Vapnik, Statistical Learning Theory(Wiley, New York, 1998).

    MATH  Google Scholar 

  97. V.N. Vapnik and A.Y. Chervonenkis, On the uniform convergence of relative frequences of events to their probabilities, Theory Probab. Appl. 17(2) (1971) 264–280.

    Article  MathSciNet  Google Scholar 

  98. V.N. Vapnik and A.Ya. Chervonenkis, The necessary and sufficient conditions for the uniform convergence of averages to their expected values, Teor. Veroyatn. i Primenen. 26(3) (1981) 543–564.

    MATH  MathSciNet  Google Scholar 

  99. V.N. Vapnik and A.Ya. Chervonenkis, The necessary and sufficient conditions for consistency in the empirical risk minimization method, Pattern Recognition and Image Analysis 1(3) (1991) 283–305.

    Google Scholar 

  100. G. Wahba, Spline bases, regularization, and generalized cross-validation for solving approximation problems with large quantities of noisy data, in: Proceedings of the International Conference on Approximation Theory in Honour of George Lorenz, eds. J. Ward and E. Cheney, Austin, TX January 8–10, 1980 (Academic Press, 1980).

  101. G. Wahba, A comparison of GCV and GML for choosing the smoothing parameter in the generalized splines smoothing problem, Ann. Statist. 13 (1985) 1378–1402.

    MATH  MathSciNet  Google Scholar 

  102. G. Wahba, Splines Models for Observational Data, Series in Applied Mathematics, Vol. 59 (SIAM, Philadelphia, PA, 1990).

    Google Scholar 

  103. R. Williamson, A. Smola and B. Scholkopf, Generalization performance of regularization networks and support vector machines via entropy numbers, Technical Report NC-TR-98–019, Royal Holloway College University of London (1998).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Evgeniou, T., Pontil, M. & Poggio, T. Regularization Networks and Support Vector Machines. Advances in Computational Mathematics 13, 1–50 (2000). https://doi.org/10.1023/A:1018946025316

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1018946025316

Navigation