Learning from noisy out-of-domain corpus using dataless classification

Yiping Jin; Dittaya Wanvarie; Phu T. V. Le

doi:10.1017/S1351324920000340

Learning from noisy out-of-domain corpus using dataless classification

Published online by Cambridge University Press: 17 June 2020

Yiping Jin

Dittaya Wanvarie and

Phu T. V. Le

Show author details

Yiping Jin: Affiliation:
Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok10300, Thailand
Dittaya Wanvarie*: Affiliation:
Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok10300, Thailand
Phu T. V. Le: Affiliation:
Knorex Pte. Ltd., 8 Cross St, Singapore 048424, Singapore
*: *Corresponding author. E-mail: Dittaya.W@chula.ac.th

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In real-world applications, text classification models often suffer from a lack of accurately labelled documents. The available labelled documents may also be out of domain, making the trained model not able to perform well in the target domain. In this work, we mitigate the data problem of text classification using a two-stage approach. First, we mine representative keywords from a noisy out-of-domain data set using statistical methods. We then apply a dataless classification method to learn from the automatically selected keywords and unlabelled in-domain data. The proposed approach outperformed various supervised learning and dataless classification baselines by a large margin. We evaluated different keyword selection methods intrinsically and extrinsically by measuring their impact on the dataless classification accuracy. Last but not least, we conducted an in-depth analysis of the behaviour of the classifier and explained why the proposed dataless classification method outperformed supervised learning counterparts.

Keywords

Text classification Dataless classification Noisy labels Domain adaptation

Type: Article
Information: Natural Language Engineering , Volume 28 , Issue 1 , January 2022 , pp. 39 - 69

DOI: https://doi.org/10.1017/S1351324920000340 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Breve, F.A., Zhao, L. and Quiles, M.G. (2010). Semi-supervised learning from imperfect data through particle cooperation and competition. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain. Institute of Electrical and Electronics Engineers, pp. 1–8.CrossRef Google Scholar

Brodley, C.E., Friedl, M.A. et al. (1996). Identifying and eliminating mislabeled training instances. In Proceedings of the National Conference on Artificial Intelligence, Portland, Oregon. Association for the Advancement of Artificial Intelligence, pp. 799–805.Google Scholar

Carbonell, J.G. and Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, vol. 98. Association for Computing Machinery, pp. 335–336.CrossRef Google Scholar

Chang, M.W., Ratinov, L.A., Roth, D. and Srikumar, V. (2008). Importance of semantic representation: Dataless classification. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, Chicago, Illinois, vol. 2. Association for the Advancement of Artificial Intelligence, pp. 830–835.Google Scholar

Charoenphakdee, N., Lee, J., Jin, Y., Wanvarie, D. and Sugiyama, M. (2019). Learning only from relevant keywords and unlabeled documents. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 3984–3993.CrossRef Google Scholar

Dahlmeier, D. (2017). On the challenges of translating NLP research into commercial products. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, vol. 2. Association for Computational Linguistics, pp. 92–96.CrossRef Google Scholar

Dietterich, T.G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40(2), 139–157.CrossRef Google Scholar

Druck, G., Mann, G. and McCallum, A. (2008). Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, Singapore. Association for Computing Machinery, pp. 595–602.CrossRef Google Scholar

Frénay, B. and Verleysen, M. (2014). Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning systems 25(5), 845–869.CrossRef Google Scholar PubMed

Gabrilovich, E., Markovitch, S. et al. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, Hyderabad, vol. 7. International Joint Conferences on Artificial Intelligence, pp. 1606–1611.Google Scholar

Gamberger, D., Lavrac, N. and Groselj, C. (1999). Experiments with noise filtering in a medical domain. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia. International Machine Learning Society, pp. 143–151.Google Scholar

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M. and Lempitsky, V. (2016). Domain-adversarial training of neural networks. Journal of Machine Learning Research 17(1), 2096–2030. ISSN 1532-4435.Google Scholar

Gerlach, R. and Stamey, J. (2007). Bayesian model selection for logistic regression with misclassified outcomes. Statistical Modelling 7(3), 255–273.CrossRef Google Scholar

Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 328–339.CrossRef Google Scholar

Hsu, P.L. and Robbins, H. (1947). Complete convergence and the law of large numbers. Proceedings of the National Academy of Sciences of the United States of America 33(2), 25.CrossRef Google Scholar PubMed

Jin, Y., Wanvarie, D. and Le, P. (2017). Combining lightly-supervised text classification models for accurate contextual advertising. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, vol. 1. Asian Federation of Natural Language Processing, pp. 545–554.Google Scholar

King, B. and Abney, S.P. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, USA. Association for Computational Linguistics, pp. 1110–1119.Google Scholar

Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, Lake Tahoe, Nevada, USA, pp. 1097–1105.Google Scholar

Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA. International Machine Learning Society, pp. 331–339.CrossRef Google Scholar

Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning-Volume 32, Beijing, China. International Machine Learning Society, pp. 1188–1196.Google Scholar

Li, C., Chen, S., Xing, J., Sun, A. and Ma, Z. (2018a). Seed-guided topic model for document filtering and classification. ACM Transactions on Information Systems (TOIS) 37(1), 9.CrossRef Google Scholar

Li, C., Xing, J., Sun, A. and Ma, Z. (2016). Effective document labeling with very few seed words: A topic model approach. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, USA. Association for Computing Machinery, pp. 85–94.CrossRef Google Scholar

Li, C., Zhou, W., Ji, F., Duan, Y. and Chen, H. (2018b). A deep relevance model for zero-shot document filtering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 2300–2310.CrossRef Google Scholar

Li, X. and Yang, B. (2018). A pseudo label based dataless naive bayes algorithm for text classification with seed words. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 1908–1917.Google Scholar

Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Portland, Oregon, USA. Association for Computational Linguistics, pp. 142–150.Google Scholar

Meng, Y., Shen, J., Zhang, C. and Han, J. (2018). Weakly-supervised neural text classification. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy. Association for Computing Machinery, pp. 983–992.CrossRef Google Scholar

Merity, S., Xiong, C., Bradbury, J. and Socher, R. (2016). Pointer sentinel mixture models. arXiv preprint.Google Scholar

Mudinas, A., Zhang, D. and Levene, M. (2018). Bootstrap domain-specific sentiment classifiers from unlabeled corpora. Transactions of the Association of Computational Linguistics 6, 269–285.CrossRef Google Scholar

Nam, J., Menca, E.L. and Fürnkranz, J. (2016). All-in text: Learning document, label, and word representations jointly. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA. Association for the Advancement of Artificial Intelligence.Google Scholar

Nam, J., Menca, E.L., Kim, H.J. and Fürnkranz, J. (2017). Maximizing subset accuracy with recurrent neural networks in multi-label classification. In Advances in Neural Information Processing Systems, Long Beach, CA, USA. Curran Associates, pp. 5413–5423.Google Scholar

Nettleton, D.F., Orriols-Puig, A. and Fornells, A. (2010). A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial Intelligence Review 33(4), 275–306.CrossRef Google Scholar

Nguyen-Hoang, B.D., Pham-Hong, B.T., Jin, Y. and Le, P. (2018). Genre-oriented web content extraction with deep convolutional neural networks and statistical methods. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation (PACLIC 32), Hong Kong, China. Association for Computational Linguistics, pp. 452–459.Google Scholar

Pan, S.J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10), 1345–1359.CrossRef Google Scholar

Pappas, N. and Henderson, J. (2019). Gile: A generalized input-label embedding for text classification. Transactions of the Association for Computational Linguistics 7, 139–155.CrossRef Google Scholar

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Association for Computational Linguistics, pp. 2227–2237. doi: 10.18653/v1/N18-1202.CrossRef Google Scholar

Ribeiro, M.T., Singh, S. and Guestrin, C. (2016). “why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. Association for Computing Machinery, pp. 1135–1144.CrossRef Google Scholar

Sachan, D, Zaheer, M. and Salakhutdinov, R. (2018). Investigating the working of text classifiers. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp. 2120–2131.Google Scholar

Settles, B. (2011). Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland. Association for Computational Linguistics, pp. 1467–1478.Google Scholar

Song, Y. and Roth, D. (2014). On dataless hierarchical text classification. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Quebec, Canada. Association for the Advancement of Artificial Intelligence Press.Google Scholar

Song, Y., Upadhyay, S., Peng, H., Mayhew, S. and Roth, D. (2019). Toward any-language zero-shot topic classification of textual documents. Artificial Intelligence 274, 133–150.CrossRef Google Scholar

Song, Y., Upadhyay, S., Peng, H. and Roth, D. (2016). Cross-lingual dataless classification for many languages. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, USA. International Joint Conferences on Artificial Intelligence, pp. 2901–2907.Google Scholar

Sun, J.W., Zhao, F.Y., Wang, C.J and Chen, S.F. (2007). Identifying and correcting mislabeled training instances. In Future Generation Communication and Networking (FGCN 2007), Jeju-Island, Korea, vol. 1. Institute of Electrical and Electronics Engineers, pp. 244–250.Google Scholar

Swartz, T.B., Haitovsky, Y., Vexler, A. and Yang, T.Y. (2004). Bayesian identifiability and misclassification in multinomial data. Canadian Journal of Statistics 32(3), 285–302.CrossRef Google Scholar

Wang, S. and Manning, C.D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, Jeju Island, Korea. Association for Computational Linguistics, pp. 90–94.Google Scholar

Wang, X., Wei, F., Liu, X., Zhou, M. and Zhang, M. (2011). Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, Scotland. Association for Computing Machinery, pp. 1031–1040.CrossRef Google Scholar

Yin, W., Hay, J. and Roth, D. (2019). Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 3905–3914.CrossRef Google Scholar

Yogatama, D., Dyer, C., Ling, W. and Blunsom, P. (2017). Generative and discriminative text classification with recurrent neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia. International Machine Learning Society.Google Scholar

Zha, D. and Li, C. (2019). Multi-label dataless text classification with topic modeling. Knowledge and Information Systems 61(1), 137–160.CrossRef Google Scholar

Zhang, X., Zhao, J. and LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, Montreal, Canada. Curran Associates, pp. 649–657.Google Scholar

Zheng, R., Tian, T., Hu, Z., Iyer, R., Sycara, K. et al. (2016). Joint embedding of hierarchical categories and entities for concept categorization and dataless classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japanpp. The COLING 2016 Organising Committee, pp. 2678–2688.Google Scholar

Zipf, G.K. (1949). The Principle of Least Effort: An Introduction to Human Ecology. Boston, USA: Addison Wesley.Google Scholar

Article contents

Learning from noisy out-of-domain corpus using dataless classification

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests