Robustness Issues in Text Mining

Turchi, Marco; Perrotta, Domenico; Riani, Marco; Cerioli, Andrea

doi:10.1007/978-3-642-33042-1_29

Marco Turchi⁷,
Domenico Perrotta⁷,
Marco Riani⁸ &
…
Andrea Cerioli⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 190))

1600 Accesses
2 Citations

Abstract

We extend the Forward Search approach for robust data analysis to address problems in text mining. In this domain, datasets are collections of an arbitrary number of documents, which are represented as vectors of thousands of elements according to the vector space model. When the number of variables v is so large and the dataset size n is smaller by order of magnitudes, the traditional Mahalanobis metric cannot be used as a similarity distance between documents. We show that by monitoring the cosine (dis)similarity measure with the Forward Search approach it is possible to perform robust estimation for a document collection and order the documents so that the most dissimilar (possibly outliers, for that collection) are left at the end. We also show that the presence of more groups of documents in the collection is clearly detected with multiple starts of the Forward Search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Atkinson, A.C., Riani, M.: Robust Diagnostic Regression Analysis. Springer, Berlin (2000)
Book MATH Google Scholar
Atkinson, A.C., Riani, M.: Exploratory tools for clustering multivariate data. Comput. Stat. Data Anal. 52, 272–285 (2007)
Article MathSciNet MATH Google Scholar
Atkinson, A.C., Riani, M., Cerioli, A.: Exploring Multivariate Data with the Forward Search. Springer, Berlin (2004)
MATH Google Scholar
Billhardt, H., Borrajo, D., Maojo, V.: A context vector model for information retrieval. J. Am Soc. Inf. Sci. Tec. 53, 236–249 (2002)
Article Google Scholar
Garcia-Escudero, L., Gordaliza, A., Matran, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat. 36, 1324–1345 (2008)
Article MathSciNet MATH Google Scholar
Huang, A.: Similarity measures for text document clustering. In: Proc. of the 6th New Zealand Computer Science Research Student Conference, Christchurch, New Zealand, pp. 49–56 (2008)
Google Scholar
Hubert, M., Rousseeuw, P.J., Van Aelst, S.: High-breakdown robust multivariate methods. Stat. Sci. 23, 92–119 (2008)
Article Google Scholar
Mao, W., Chu, W.W.: Free-text medical document retrieval via phrase-based vector space model. In: Proc. of the AMIA Symposium, p. 489 (2002)
Google Scholar
Pouliquen, B., Steinberger, R., Ignat, C.: Automatic annotation of multilingual text collections with a conceptual thesaurus In: Proc. of the Workshop Ontologies and Information Extraction at the EUROLAN 2003, Bucharest, Romania (2003)
Google Scholar
Riani, M., Perrotta, D., Torti, F.: FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemometr. Intell. Lab. 116, 17–32 (2012)
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Article MATH Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inform. Process. Manag. 24, 513–522 (1988)
Article Google Scholar
Steinberger, R., Ebrahim, M., Turchi, M.: JRC Eurovoc Indexer JEX — A freely available multi-label categorisation tool. In: Proc. of the 8th Int. Conf. on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey (2012)
Google Scholar
Yates, R.B., Neto, B.R.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

European Commission, Joint Research Centre, Brussels, Belgium
Marco Turchi & Domenico Perrotta
Department of Economics, University of Parma, Parma, Italy
Marco Riani & Andrea Cerioli

Authors

Marco Turchi
View author publications
You can also search for this author in PubMed Google Scholar
Domenico Perrotta
View author publications
You can also search for this author in PubMed Google Scholar
Marco Riani
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Cerioli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Turchi .

Editor information

Editors and Affiliations

Faculty of Computer Science, Otto-von-Guericke University of Magdebur, Geb. 29, Raum 008, Universitätsplatz 2, Magdeburg, 39106, Germany
Rudolf Kruse
, FB Informatik & Informationswissenschaft, University of Konstanz, Konstanz, 78457, Germany
Michael R. Berthold
of Magdeburg, Faculty of Computer Science, Otto-von-Guericke University, Geb. 29, Universitätsplatz 2 008, Magdeburg, 39106, Germany
Christian Moewes
, Department of Statistics and OR, University of Oviedo, C/ Calvo Sotelo, s/n, Oviedo, 33007, Spain
María Ángeles Gil
Systems Research Institute, Polish Academy of Sciences, Newelska 6, Warsaw, 01-447, Poland
Przemysław Grzegorzewski
Systems Research Institute, Polish Academy of Sciences, Newelska 6, Warsaw, 01-447, Poland
Olgierd Hryniewicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Turchi, M., Perrotta, D., Riani, M., Cerioli, A. (2013). Robustness Issues in Text Mining. In: Kruse, R., Berthold, M., Moewes, C., Gil, M., Grzegorzewski, P., Hryniewicz, O. (eds) Synergies of Soft Computing and Statistics for Intelligent Data Analysis. Advances in Intelligent Systems and Computing, vol 190. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33042-1_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-33042-1_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33041-4
Online ISBN: 978-3-642-33042-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics