Computing LTS Regression for Large Data Sets

ROUSSEEUW, PETER J.; VAN DRIESSEN, KATRIEN

doi:10.1007/s10618-005-0024-4

Computing LTS Regression for Large Data Sets

Published: 03 February 2006

Volume 12, pages 29–45, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

PETER J. ROUSSEEUW¹ &
KATRIEN VAN DRIESSEN²

3363 Accesses
346 Citations
6 Altmetric
Explore all metrics

Abstract

Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics and sums of squared residuals, and techniques which we call ‘selective iteration’ and ‘nested extensions’. We also use an intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude. This allows us to apply FAST-LTS to large databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agulló, J. 1997a. Computación de estimadores con alto punto de ruptura. Ph.D. Thesis, University of Alicante, Spain.
Agulló, J. 1997b. Exact algorithms to compute the least median of squares estimate in multiple linear regression. In L₁-Statistical Procedures and Related Topics, Y. Dodge (ed.), The IMS Lecture Notes – Monograph Series, Volume 31, pp. 133–146.
Chork, C.J. 1990. Unmasking multivariate anomalous observations in exploration geochemical data from sheeted-vein tin mineralization near Emmaville, N.S.W., Australia. Journal of Geochemical Exploration, 37:191–203.
Coakley, C.W. and Hettmansperger, T.P. 1993. A bounded influence, high breakdown, efficient regression estimator. Journal of the American Statistical Association, 88:872–880.
Google Scholar
Hawkins, D.M. 1994. The feasible solution algorithm for least trimmed squares regression. Computational Statistics and Data Analysis, 17:185–196.
Google Scholar
Hawkins, D.M. and Olive, D.J. 1999. Improved feasible solution algorithms for high breakdown estimation. Computational Statistics and Data Analysis, 30:1–11.
Google Scholar
Hössjer, O. 1994. Rank-based estimates in the linear model with high breakdown point. Journal of the American Statistical Association, 89:149–158.
Google Scholar
Huang, Z. 1998. Extensions of the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2:283–304.
Google Scholar
Kaufman, L. and Rousseeuw, P.J. 1986. Clustering large data sets. In Pattern Recognition in Practice II, E.S. Gelsema and L.N. Kanal (eds.) Elsevier/North-Holland, pp. 425–437.
Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data, New York: John Wiley.
Meer, P., Mintz, D., Rosenfeld, A., and Kim, D. 1991. Robust regression methods in computer vision: a review. International Journal of Computer Vision, 6:59–70.
Google Scholar
Mili, L., Phaniraj, V., and Rousseeuw, P.J. 1991. Least median of squares estimation in power systems (with discussion). IEEE Trans. on Power Systems, 6:511–523.
Google Scholar
Mili, L., Cheniae, N.S., and Rousseeuw, P.J. 1996. Robust state estimation based on projection statistics (with discussion). IEEE Trans. on Power Systems, 11:1118–1127.
Google Scholar
Ng, R.T. and Han, J., 1994. Efficient and effective clustering methods for spatial data mining. Proceedings of the International Conference on Very Large Data Bases (VLDB ’94), Santiago, Chile, September 1994, pp. 144–155.
Odewahn, S.C., Djorgovski, S.G., Brunner, R.J., and Gal, R. 1998. Data From the Digitized Palomar Sky Survey. Technical Report, California Institute of Technology.
Rousseeuw, P.J. 1984. Least median of squares regression. Journal of the American Statistical Association, 79:871–880.
Google Scholar
Rousseeuw, P.J. 1985. Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications, Vol B, W. Grossmann, G. Pflug, I. Vincze and W. Wertz (eds.) Dordrecht: Reidel, pp. 283–297.
Rousseeuw, P.J. 1997. Introduction to positive-breakdown methods. In Handbook of Statistics, Vol. 15: Robust Inference, G.S. Maddala and C.R. Rao (eds.) Amsterdam: Elsevier, pp. 101–121.
Rousseeuw, P.J. and Hubert, M. 1997. Recent developments in PROGRESS. In \({\rm L}_1\)-Statistical Procedures and Related Topics, Y. Dodge (ed.) The IMS Lecture Notes – Monograph Series, Vol. 31, pp. 201–214.
Rousseeuw, P.J. and Leroy, A.M. 1987. Robust Regression and Outlier Detection, New York: John Wiley.
Rousseeuw, P.J. and Van Driessen, K. 1999. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212–223.
Google Scholar
Rousseeuw, P.J. and van Zomeren, B.C., 1990. Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85:633–639.
Google Scholar
Steele, J.M. and Steiger, W.L. 1986. Algorithms and complexity for least median of squares regression. Discrete Applied Mathematics, 14:93–100.
Google Scholar
Stromberg, A.J. 1993. Computing the exact least median of squares estimate and stability diagnostics in multiple linear regression. SIAM Journal of Scientific Computing, 14:1289–1299.
Google Scholar
Simpson, D.G., Ruppert, D., and Carroll, R.J. 1992. On one-step GM-estimates and stability of inferences in linear regression. Journal of the American Statistical Association, 87:439–450.
Google Scholar
Wang, C.M., Vecchia, D.F., Young, M. and Brilliant, N.A. 1997. Robust regression applied to optical fiber dimensional quality control. Technometrics, 39:25–33.
Google Scholar
Woodruff, D.L. and Rocke, D.M. 1994. Computable robust estimation of multivariate location and shape in high dimension using compound estimators. Journal of the American Statistical Association, 89:888–896.
Google Scholar
Yohai, V.J. 1987. High breakdown point and high efficiency robust estimates for regression. Annals of Statistics, 15:642–656.
Google Scholar
Zhang, T., Ramakrishnan, R., and Livny, M. 1997. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1:141–182.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Universiteit Antwerpen, Middelheimlaan 1, B-2020, Antwerpen, Belgium
PETER J. ROUSSEEUW
Faculty of Applied Economics, Universiteit Antwerpen, Prinsstraat 13, B-2000, Antwerpen, Belgium
KATRIEN VAN DRIESSEN

Authors

PETER J. ROUSSEEUW
View author publications
You can also search for this author in PubMed Google Scholar
KATRIEN VAN DRIESSEN
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to PETER J. ROUSSEEUW.

Rights and permissions

Reprints and permissions

About this article

Cite this article

ROUSSEEUW, P.J., VAN DRIESSEN, K. Computing LTS Regression for Large Data Sets. Data Min Knowl Disc 12, 29–45 (2006). https://doi.org/10.1007/s10618-005-0024-4

Download citation

Accepted: 03 June 2005
Published: 03 February 2006
Issue Date: January 2006
DOI: https://doi.org/10.1007/s10618-005-0024-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computing LTS Regression for Large Data Sets

Abstract

Access this article

Similar content being viewed by others

Detecting Outliers and Influential and Sensitive Observations in Linear Regression

Robust regression via error tolerance

The shooting S-estimator for robust regression

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Computing LTS Regression for Large Data Sets

Abstract

Access this article

Similar content being viewed by others

Detecting Outliers and Influential and Sensitive Observations in Linear Regression

Robust regression via error tolerance

The shooting S-estimator for robust regression

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation