ABSTRACT
Labeled datasets are always limited, and oftentimes the quantity of labeled data is a bottleneck for data analytics. This especially affects supervised machine learning methods, which require labels for models to learn from the labeled data. Active learning algorithms have been proposed to help achieve good analytic models with limited labeling efforts, by determining which additional instance labels will be most beneficial for learning for a given model. Active learning is consistent with interactive analytics as it proceeds in a cycle in which the unlabeled data is automatically explored. However, in active learning users have no control of the instances to be labeled, and for text data, the annotation interface is usually document only. Both of these constraints seem to affect the performance of an active learning model. We hypothesize that visualization techniques, particularly interactive ones, will help to address these constraints. In this paper, we implement a pilot study of visualization in active learning for text classification, with an interactive labeling interface. We compare the results of three experiments. Early results indicate that visualization improves high-performance machine learning model building with an active learning algorithm.
- Shilpa Arora, Eric Nyberg, and Carolyn P Rosé. 2009. Estimating annotation cost for active learning in a multi-annotator environment. In Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing. Association for Computational Linguistics, 18--26. Google ScholarCross Ref
- John Blitzer, Mark Dredze, Fernando Pereira, and others. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, Vol. 7. 440--447.Google Scholar
- Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3 data-driven documents. IEEE transactions on visualization and computer graphics 17, 12 (2011), 2301--2309. Google ScholarDigital Library
- Michael Brooks, Saleema Amershi, Bongshin Lee, Steven M Drucker, Ashish Kapoor, and Patrice Simard. 2015. FeatureInsight: Visual support for error-driven feature ideation in text classification. In Visual Analytics Science and Technology (VAST), 2015 IEEE Conference on. IEEE, 105--112.Google ScholarCross Ref
- Florian Heimerl, Steffen Koch, Harald Bosch, and Thomas Ertl. 2012. Visual classifier training for text document retrieval. IEEE Transactions on Visualization and Computer Graphics 18, 12 (2012), 2839--2848. Google ScholarDigital Library
- Hongsen Liao, Li Chen, Yibo Song, and Hao Ming. 2015. Visualization Based Active Learning for Video Annotation. IEEE Transactions on Multimedia (2015).Google Scholar
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov (2008), 2579--2605.Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA. Google ScholarCross Ref
- Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom Mitchell, and others. 1998. Learning to classify text from labeled and unlabeled documents. AAAI/IAAI 792 (1998).Google Scholar
- Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin-Madison.Google Scholar
- Burr Settles. 2011. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1467--1478.Google ScholarDigital Library
- Vladimir Naumovich Vapnik and Samuel Kotz. 1982. Estimation of dependences based on empirical data. Vol. 40. Springer-Verlag New York.Google Scholar
Index Terms
- Active Learning with Visualization for Text Data
Recommendations
Effective multi-label active learning for text classification
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningLabeling text data is quite time-consuming but essential for automatic text classification. Especially, manually creating multiple labels for each document may become impractical when a very large amount of data is needed for training multi-label text ...
Cost‐effective multi‐instance multilabel active learning
AbstractMulti‐instance multi‐label (MIML) Active Learning (M2AL) aims to improve the learner while reducing the cost as much as possible by querying informative labels of complex bags composed of diverse instances. Existing M2AL solutions suffer high ...
Combining active learning and semi-supervised for improving learning performance
ISABEL '11: Proceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication TechnologiesIn many learning tasks, there are abundant unlabeled samples but the number of labeled training samples is limited, because labeling the samples requires the efforts of human annotators and expertise. There are three major techniques for labeling the ...
Comments