Skip to main content
Log in

Outlier detection in interval data

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  • Billard B, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487

    Article  MathSciNet  Google Scholar 

  • Bock H-H, Diday E (2000) Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer, Heidelberg

    MATH  Google Scholar 

  • Brito P (2014) Symbolic data analysis: another look at the interaction of data mining and statistics. WIREs Data Min Knowl Discov 4(4):281–295

    Article  Google Scholar 

  • Brito P, Duarte Silva AP (2012) Modelling interval data with Normal and Skew-Normal distributions. J Appl Stat 39(1):3–20

    Article  MathSciNet  Google Scholar 

  • Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105(489):147–156

    Article  MathSciNet  Google Scholar 

  • De Carvalho FAT, Brito P, Bock H-H (2006) Dynamic clustering for interval data based on \(L_2\) distance. Comput Stat 21(2):231–250

    Article  MathSciNet  Google Scholar 

  • De Carvalho FAT, Lechevallier Y (2009) Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recogn 42(7):1223–1236

    Article  Google Scholar 

  • Dias S, Brito P (2017) Off the beaten track: a new linear model for interval data. Eur J Oper Res 258(3):1118–1130

    Article  MathSciNet  Google Scholar 

  • Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, Chichester

    MATH  Google Scholar 

  • Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4(2):229–246

    Article  MathSciNet  Google Scholar 

  • Duarte Silva AP, Brito P (2017) MAINT.DATA: Model and analyze interval data. R Package,version 1.2.0. http://cran.r-project.org/web/packages/MAINT.Data/index.html

  • Duarte Silva AP, Brito P (2015) Discriminant analysis of interval data: an assessment of parametric and distance-based approaches. J Classif 32(3):516–541

    Article  MathSciNet  Google Scholar 

  • Filzmoser P (2004) A multivariate outlier detection method. In: S. Aivazian, P. Filzmoser and Yu. Kharin, editors, In Proceedings of the 7th international conference on computer data analysis and modeling, vol 1, 18–22, Belarusian State University, Minsk

  • Filzmoser P, Reimann C, Garrett RG (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31:579–587

    Article  Google Scholar 

  • Hadi AS, Luceño A (1997) Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Comput Stat Data Anal 25(3):251–272

    Article  MathSciNet  Google Scholar 

  • Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Gr Stat 14:910–927

    Article  MathSciNet  Google Scholar 

  • Hubert M, Rousseeuw PJ, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23(1):92–119

    Article  MathSciNet  Google Scholar 

  • Korkmaz S, Goksuluk D, Zararsiz G (2014) MVN: an R package for assessing multivariate normality. R J 6(2):151–162

    Google Scholar 

  • Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Inference 141:1593–1602

    Article  MathSciNet  Google Scholar 

  • Le-Rademacher J, Billard L (2012) Symbolic covariance principal component analysis and visualization for interval-valued data. J Comput Gr Stat 21(2):413–432

    Article  MathSciNet  Google Scholar 

  • Li S, Lee R, Lang S-D (2006) Detecting outliers in interval data. In Proceedings of the 44th annual southeast regional conference, ACM, pp 290–295

  • Lima Neto E, De Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52(3):1500–1515

    Article  MathSciNet  Google Scholar 

  • Lima Neto E, De Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54(2):333–347

    Article  MathSciNet  Google Scholar 

  • Lima Neto E, Cordeiro GM, De Carvalho FAT (2011) Bivariate symbolic regression models for interval-valued variables. J Stat Comput Simul 81(11):1727–1744

    Article  MathSciNet  Google Scholar 

  • Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52(1):299–308

    Article  MathSciNet  Google Scholar 

  • Neykov NM, Müller CH (2003) Breakdown point and computation of trimmed likelihood estimators in generalized linear models. In: Dutter R, Filzmoser P, Gather U, Rousseeuw PJ (eds) Developments in robust statistics. Physica-Verlag, Heidelberg, pp 277–286

    Chapter  Google Scholar 

  • Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170

    Article  MathSciNet  Google Scholar 

  • Pison G, Van Aelst S, Willems G (2002) Small sample corrections for LTS and MCD. Metrika 55(1–2):111–123

    Article  MathSciNet  Google Scholar 

  • Ramos-Guajardo AB, Grzegorzewski P (2016) Distance-based linear discriminant analysis for interval-valued data. Inf Sci 372:591–607

    Article  Google Scholar 

  • Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880

    Article  MathSciNet  Google Scholar 

  • Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8:283–297

    Article  MathSciNet  Google Scholar 

  • Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223

    Article  Google Scholar 

  • Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639

    Article  Google Scholar 

  • Van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth, London

    MATH  Google Scholar 

  • Vandev DL, Neykov NM (1998) About regression estimators with high breakdown point. Statistics 32:111–129

    Article  MathSciNet  Google Scholar 

  • Viattchenin D (2012) Detecting outliers in interval-valued data using heuristic possibilistic clustering. J Comput Sci Control Syst 5(2):39–44

    Google Scholar 

Download references

Acknowledgements

This work is financed by the ERDF-European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation-COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT - Fundação para a Ciência e Tecnologia (Portuguese Foundation for Science and Technology) as part of projects UID/EEA/50014/2013 and UID/GES/00731/2013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Pedro Duarte Silva.

Appendices

Appendix

Determination of correction factors

The finite-sample bias-correction factors \( c^1_{m, h, n, 2p, Cf}\) and \(c_{h, n, 2p, Cf}\) used in expression (4), are obtained in the following way:

First, based on 1000 independent replications of independently generated standardized Gaussian values, for each combination of \(h\,=\,\{[0.5n], [0.75n], [0.875n]\}, n\,=\,\{30, 50, 75, 100, 150, 200, 300, 500\}\),   \(q = 2p\) with \( \textit{p}\,=\,\{1, 2, 3, 4, 5, 7, 10, 15\}\) and covariance configurations \(Cf\,=\, \{\)C1, C2, C3, C4\(\}\) we found the average of \(\tau \,=\, |\hat{\varSigma }|^{1/q}\), i.e., the \(2p^{th}\) root of the raw consistent-adjusted MCD determinant, which we denote by \(avg(\tau )\). Then, for the values of hn and q included in these simulations \( c^*_{h, n, q, Cf} = \frac{1}{avg(\tau )}\) are our first approximations to \( c_{h, n, q, Cf}\) . In order to find approximations for the remaining parameter values, for each qCf and \(h\,=\,\{0.5, 0.875\}\) we fitted the models

$$\begin{aligned} \hat{c}^*_{h, n, q, Cf}(n) = 1 + \frac{\gamma _{h, q, Cf}}{n^{\beta _{h, q, Cf}}} \end{aligned}$$
(8)

and then for each \(Cf = \{\)C1, C2, C3, C4\(\}, \, h\,=\,\{0.5, 0.875\}, \, r \,=\, \{3,5\} \), \(q \,=\, 2p\) with \( \textit{p}\,=\, \{1, 2, 3, 4, 5, 7, 10, 15\}\) and \( n = r q^2\) we fitted

$$\begin{aligned} \hat{c}^*_{h, n, q, Cf}(q) = 1 + \frac{\eta _{h, r, Cf}}{q^{\kappa _{h, r, Cf}}} \end{aligned}$$
(9)

Note that \(\hat{c}^*_{h, n, q, Cf}(n) \) tends to 1 when n and/or q tend to infinity.

The final approximation for any n and q is found by first solving the system

$$\begin{aligned} \frac{\eta _{h, 3, Cf}}{q^{\kappa _{h, 3, Cf}}}= & {} \frac{\gamma _{h,q,Cf}}{(3q^2)^{\beta _{h, q, Cf}}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \nonumber \\ \frac{\eta _{h, 5, Cf}}{q^{\kappa _{h, 5, Cf}}}= & {} \frac{\gamma _{h,q,Cf}}{(5q^2)^{\beta _{h, q, Cf}}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \end{aligned}$$
(10)

in order to \(\gamma _{h,q,Cf}\) and \(\beta _{h, q, Cf}\), and then setting \(c_{h, n, q, Cf} = 1 + \frac{\gamma _{h, q, Cf}}{n^{\beta _{h,q,Cf}}}\).

We did not include \(h\,=\,0.75\) or any other h in these models because, as in Pison et al. (2002), we found \(c^*_{h,n,q,Cf}\) to be roughly proportional to h so that \(c_{h,n,q,Cf}\) for different h values could be found by linear interpolation.

We note that this procedure is identical to the one described in Pison et al. (2002) with the only exception that we have one additional set of model parameters and correction factors for each covariance configuration Cf. In fact, we have found all the auxiliary models for \( c^*_{h, n, q, Cf}\) to be well adjusted, but with different parameter values for each configuration Cf, as it can be seen in Table 15.

Table 15 Auxiliary model parameters-raw MCD

The authors in Pison et al. (2002) briefly mention that they replicated the same procedure with one step re-weighted instead of raw MCD estimates, in order to find the \(c^1\) one-step re-weighted finite-sample bias-correction factors. We followed their steps but found out that in this case the corresponding \( c^{1*}_{h,n,q,Cf}\) approximations were no longer roughly proportional on h, and could have coefficients of determination below 0.05 when regressed on h. This is not that much surprising since the re-weighted MCD uses m instead of h observations to build its final estimate. Therefore, we performed the same simulations as before, but in each replication saved the value of m, and adjusted the following linear regression models (one for each configuration Cf):

$$\begin{aligned} \tau = \beta ^{*}_{0, n, q, Cf} + \beta ^{*}_{1,Cf} \, \frac{m}{n} + \beta ^{*}_{2,Cf} \, \frac{h}{n} \end{aligned}$$
(11)

where the intercepts \(\beta ^{*}_{0, n, q, Cf}\) were found by including dummy variables with all their interactions for all the n and q values used in the simulations.

Then, we adjusted the following models

$$\begin{aligned} \hat{\beta }^{*}_{0, n, q, Cf}(n)= & {} 1 - \beta ^{*}_{1, Cf} - \beta ^{*}_{2, Cf} + \frac{\gamma _{q, Cf}}{n^{\beta _{q, Cf}}} \end{aligned}$$
(12)
$$\begin{aligned} \hat{\beta }^{*}_{0, n, q, Cf}(q)= & {} 1 - \beta ^{*}_{1, Cf} - \beta ^{*}_{2, Cf} + \frac{\eta _{r, Cf}}{q^{\kappa _{r, Cf}}} \end{aligned}$$
(13)

ensuring that when n and q tend to infinity \(\hat{\beta }^{*}_{0, n, q, Cf}(n) + \beta ^{*}_{1, Cf} + \beta ^{*}_{2, Cf}\) and \(\hat{\beta }^{*}_{0, n, q, Cf}(q) + \beta ^{*}_{1, Cf} + \beta ^{*}_{2, Cf}\) tend to 1.

We then proceeded as before and found again that all auxiliary models were well adjusted. The estimated values for \(\eta _{r, Cf}\), \(\kappa _{r, Cf}\), \(\beta ^{*}_{1, Cf}\) and \(\beta ^{*}_{2, Cf}\) and given in Table 16.

Table 16 Auxiliary model parameters–re-weighted MCD

We note that the m coefficient, \(\beta ^{*}_{1, Cf}\), is indeed the most important one and is always positive, however the h coefficient, \(\beta ^{*}_{2, Cf}\), always negative, is also highly significant. Furthermore, the values in both tables vary considerably according to the covariance configuration, in particular regarding parameter \(\kappa \) which measures the impact of the number of variables in the bias correction factor.

The final \(c^{1}_{m, h, n, q, Cf}\) correction factors are defined by equation

$$\begin{aligned} c^{1}_{m, h, n, q, Cf} \ = \ \frac{1}{\hat{\tau }} \ = \ \frac{1}{\hat{\beta }^{*}_{0, n, q, Cf}(n) + \beta ^{*}_{1, Cf} \, \frac{m}{n} + \beta ^{*}_{2, Cf} \, \frac{h}{n}} \end{aligned}$$
(14)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duarte Silva, A.P., Filzmoser, P. & Brito, P. Outlier detection in interval data. Adv Data Anal Classif 12, 785–822 (2018). https://doi.org/10.1007/s11634-017-0305-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-017-0305-y

Keywords

Mathematics Subject Classification

Navigation