Abstract
A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data.
Similar content being viewed by others
References
Billard B, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487
Bock H-H, Diday E (2000) Analysis of symbolic data. Exploratory methods for extracting statistical information from complex data. Springer, Heidelberg
Brito P (2014) Symbolic data analysis: another look at the interaction of data mining and statistics. WIREs Data Min Knowl Discov 4(4):281–295
Brito P, Duarte Silva AP (2012) Modelling interval data with Normal and Skew-Normal distributions. J Appl Stat 39(1):3–20
Cerioli A (2010) Multivariate outlier detection with high-breakdown estimators. J Am Stat Assoc 105(489):147–156
De Carvalho FAT, Brito P, Bock H-H (2006) Dynamic clustering for interval data based on \(L_2\) distance. Comput Stat 21(2):231–250
De Carvalho FAT, Lechevallier Y (2009) Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recogn 42(7):1223–1236
Dias S, Brito P (2017) Off the beaten track: a new linear model for interval data. Eur J Oper Res 258(3):1118–1130
Diday E, Noirhomme-Fraiture M (2008) Symbolic data analysis and the SODAS software. Wiley, Chichester
Douzal-Chouakria A, Billard L, Diday E (2011) Principal component analysis for interval-valued observations. Stat Anal Data Min 4(2):229–246
Duarte Silva AP, Brito P (2017) MAINT.DATA: Model and analyze interval data. R Package,version 1.2.0. http://cran.r-project.org/web/packages/MAINT.Data/index.html
Duarte Silva AP, Brito P (2015) Discriminant analysis of interval data: an assessment of parametric and distance-based approaches. J Classif 32(3):516–541
Filzmoser P (2004) A multivariate outlier detection method. In: S. Aivazian, P. Filzmoser and Yu. Kharin, editors, In Proceedings of the 7th international conference on computer data analysis and modeling, vol 1, 18–22, Belarusian State University, Minsk
Filzmoser P, Reimann C, Garrett RG (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31:579–587
Hadi AS, Luceño A (1997) Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Comput Stat Data Anal 25(3):251–272
Hardin J, Rocke DM (2005) The distribution of robust distances. J Comput Gr Stat 14:910–927
Hubert M, Rousseeuw PJ, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23(1):92–119
Korkmaz S, Goksuluk D, Zararsiz G (2014) MVN: an R package for assessing multivariate normality. R J 6(2):151–162
Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Inference 141:1593–1602
Le-Rademacher J, Billard L (2012) Symbolic covariance principal component analysis and visualization for interval-valued data. J Comput Gr Stat 21(2):413–432
Li S, Lee R, Lang S-D (2006) Detecting outliers in interval data. In Proceedings of the 44th annual southeast regional conference, ACM, pp 290–295
Lima Neto E, De Carvalho FAT (2008) Centre and range method for fitting a linear regression model to symbolic interval data. Comput Stat Data Anal 52(3):1500–1515
Lima Neto E, De Carvalho FAT (2010) Constrained linear regression models for symbolic interval-valued variables. Comput Stat Data Anal 54(2):333–347
Lima Neto E, Cordeiro GM, De Carvalho FAT (2011) Bivariate symbolic regression models for interval-valued variables. J Stat Comput Simul 81(11):1727–1744
Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52(1):299–308
Neykov NM, Müller CH (2003) Breakdown point and computation of trimmed likelihood estimators in generalized linear models. In: Dutter R, Filzmoser P, Gather U, Rousseeuw PJ (eds) Developments in robust statistics. Physica-Verlag, Heidelberg, pp 277–286
Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min 4(2):157–170
Pison G, Van Aelst S, Willems G (2002) Small sample corrections for LTS and MCD. Metrika 55(1–2):111–123
Ramos-Guajardo AB, Grzegorzewski P (2016) Distance-based linear discriminant analysis for interval-valued data. Inf Sci 372:591–607
Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880
Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8:283–297
Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223
Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639
Van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth, London
Vandev DL, Neykov NM (1998) About regression estimators with high breakdown point. Statistics 32:111–129
Viattchenin D (2012) Detecting outliers in interval-valued data using heuristic possibilistic clustering. J Comput Sci Control Syst 5(2):39–44
Acknowledgements
This work is financed by the ERDF-European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation-COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT - Fundação para a Ciência e Tecnologia (Portuguese Foundation for Science and Technology) as part of projects UID/EEA/50014/2013 and UID/GES/00731/2013.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix
Determination of correction factors
The finite-sample bias-correction factors \( c^1_{m, h, n, 2p, Cf}\) and \(c_{h, n, 2p, Cf}\) used in expression (4), are obtained in the following way:
First, based on 1000 independent replications of independently generated standardized Gaussian values, for each combination of \(h\,=\,\{[0.5n], [0.75n], [0.875n]\}, n\,=\,\{30, 50, 75, 100, 150, 200, 300, 500\}\), \(q = 2p\) with \( \textit{p}\,=\,\{1, 2, 3, 4, 5, 7, 10, 15\}\) and covariance configurations \(Cf\,=\, \{\)C1, C2, C3, C4\(\}\) we found the average of \(\tau \,=\, |\hat{\varSigma }|^{1/q}\), i.e., the \(2p^{th}\) root of the raw consistent-adjusted MCD determinant, which we denote by \(avg(\tau )\). Then, for the values of h, n and q included in these simulations \( c^*_{h, n, q, Cf} = \frac{1}{avg(\tau )}\) are our first approximations to \( c_{h, n, q, Cf}\) . In order to find approximations for the remaining parameter values, for each q, Cf and \(h\,=\,\{0.5, 0.875\}\) we fitted the models
and then for each \(Cf = \{\)C1, C2, C3, C4\(\}, \, h\,=\,\{0.5, 0.875\}, \, r \,=\, \{3,5\} \), \(q \,=\, 2p\) with \( \textit{p}\,=\, \{1, 2, 3, 4, 5, 7, 10, 15\}\) and \( n = r q^2\) we fitted
Note that \(\hat{c}^*_{h, n, q, Cf}(n) \) tends to 1 when n and/or q tend to infinity.
The final approximation for any n and q is found by first solving the system
in order to \(\gamma _{h,q,Cf}\) and \(\beta _{h, q, Cf}\), and then setting \(c_{h, n, q, Cf} = 1 + \frac{\gamma _{h, q, Cf}}{n^{\beta _{h,q,Cf}}}\).
We did not include \(h\,=\,0.75\) or any other h in these models because, as in Pison et al. (2002), we found \(c^*_{h,n,q,Cf}\) to be roughly proportional to h so that \(c_{h,n,q,Cf}\) for different h values could be found by linear interpolation.
We note that this procedure is identical to the one described in Pison et al. (2002) with the only exception that we have one additional set of model parameters and correction factors for each covariance configuration Cf. In fact, we have found all the auxiliary models for \( c^*_{h, n, q, Cf}\) to be well adjusted, but with different parameter values for each configuration Cf, as it can be seen in Table 15.
The authors in Pison et al. (2002) briefly mention that they replicated the same procedure with one step re-weighted instead of raw MCD estimates, in order to find the \(c^1\) one-step re-weighted finite-sample bias-correction factors. We followed their steps but found out that in this case the corresponding \( c^{1*}_{h,n,q,Cf}\) approximations were no longer roughly proportional on h, and could have coefficients of determination below 0.05 when regressed on h. This is not that much surprising since the re-weighted MCD uses m instead of h observations to build its final estimate. Therefore, we performed the same simulations as before, but in each replication saved the value of m, and adjusted the following linear regression models (one for each configuration Cf):
where the intercepts \(\beta ^{*}_{0, n, q, Cf}\) were found by including dummy variables with all their interactions for all the n and q values used in the simulations.
Then, we adjusted the following models
ensuring that when n and q tend to infinity \(\hat{\beta }^{*}_{0, n, q, Cf}(n) + \beta ^{*}_{1, Cf} + \beta ^{*}_{2, Cf}\) and \(\hat{\beta }^{*}_{0, n, q, Cf}(q) + \beta ^{*}_{1, Cf} + \beta ^{*}_{2, Cf}\) tend to 1.
We then proceeded as before and found again that all auxiliary models were well adjusted. The estimated values for \(\eta _{r, Cf}\), \(\kappa _{r, Cf}\), \(\beta ^{*}_{1, Cf}\) and \(\beta ^{*}_{2, Cf}\) and given in Table 16.
We note that the m coefficient, \(\beta ^{*}_{1, Cf}\), is indeed the most important one and is always positive, however the h coefficient, \(\beta ^{*}_{2, Cf}\), always negative, is also highly significant. Furthermore, the values in both tables vary considerably according to the covariance configuration, in particular regarding parameter \(\kappa \) which measures the impact of the number of variables in the bias correction factor.
The final \(c^{1}_{m, h, n, q, Cf}\) correction factors are defined by equation
Rights and permissions
About this article
Cite this article
Duarte Silva, A.P., Filzmoser, P. & Brito, P. Outlier detection in interval data. Adv Data Anal Classif 12, 785–822 (2018). https://doi.org/10.1007/s11634-017-0305-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-017-0305-y