skip to main content
10.1145/2063576.2063768acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge

Authors Info & Claims
Published:24 October 2011Publication History

ABSTRACT

With the flourishing of community-based question answering (cQA) services like Yahoo! Answers, more and more web users seek their information need from these sites. Understanding user's information need expressed through their search questions is crucial to information providers. Question classification in cQA is studied for this purpose. However, there are two main difficulties in applying traditional methods (question classification in TREC QA and text classification) to cQA: (1) Traditional methods confine themselves to classify a text or question into two or a few predefined categories. While in cQA, the number of categories is much larger, such as Yahoo! Answers, there contains 1,263 categories. Our empirical results show that with the increasing of the number of categories to moderate size, the performance of the classification accuracy dramatically decreases. (2) Unlike the normal texts, questions in cQA are very short, which cannot provide sufficient word co-occurrence or shared information for a good similarity measure due to the data sparseness. In this paper, we propose a two-stage approach for question classification in cQA that can tackle the difficulties of the traditional methods. In the first stage, we preform a search process to prune the large-scale categories to focus our classification effort on a small subset. In the second stage, we enrich questions by leveraging Wikipedia semantic knowledge to tackle the data sparseness. As a result, the classification model is trained on the enriched small subset. We demonstrate the performance of our proposed method on Yahoo! Answers with 1,263 categories. The experimental results show that our proposed method significantly outperforms the baseline method (with error reductions of 23.21%).

References

  1. A. Berger, A. Pietra, and J. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. X. Cao, G. Cong, B. Cui, and C. S. Jensen. A generalized framework of exploring category information for question retrieval in community question answer archives. In WWW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Cao, H. Duan, C.-Y. Lin, Y. Yu, and H.-W. Hon. Recommending questions using the mdl-based tree cut model. In WWW, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Duan, Y. Cao, C. Y. Lin, and Y. Yu. Searching questions by identifying questions topics and question focus. In ACL, pages 156--164, 2008.Google ScholarGoogle Scholar
  5. E. Gebrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorication with encyclopedia knowledge. In IJCAI, pages 1301--1306, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In SIGIR, 2003.Google ScholarGoogle Scholar
  7. J. Hu, L. Fang, Y. Cao, H. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploting internal and external semantics for the clustering of short texts using world knowledge. In CIKM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou. Exploiting wikipedia as external knowledge for document clustering. In KDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Kumaran and J. Allan. Text classification and named entities for new event detection. In SIGIR, pages 297--304, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Liu and J. Nocedal. On the limited memory bfgs method for large-scale optimization. Mathematical Programming, 45:503--528, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Miller. Wordnet: a lexical database for english. CACM, 38:39--41, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Milne, Q. Medelyan, and I. H. Witten. Mining domain-specific thesauri from wikipedia: a case study. In IEEE/WIC/ACM WI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Z.-Y. Ming, K. Wang, and T.-S. Chua. Prototype hierarchy based clustering for the categorization and navigation of web collection. In SIGIR, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Moschitti, S. Quarteroni, R. Basili, and S. Manandhar. Exploiting syntactic and shallow semantic kernels for question/answer classification. In ACL, pages 776--783, 2007.Google ScholarGoogle Scholar
  16. T. Nguyen, L. Nguyen, and A. Shimazu. Using semi-supervised learning for question classification. Journal of Natural Language Processing, 15(1):3--21, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  17. X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse txt & web with hidden topics from large-scale data collections. In WWW, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. P. Ponzetto and M. Strube. Deriving a large scale taxonomy from wikipedia. In AAAI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. Text classification improved through multigram models. In CIKM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core semantic knowledge unifying wordnet and wikipedia. In WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Wang, Z. Ming, and T.-S. Chua. A syntactic tree matching approach to finding similar questions in community-based qa services. In SIGIR, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Wang and C. Domeniconl. Building semantic kernels for text classification using wikipedia. In KDD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Wang, J. Hu, H.-J. Zeng, L. Chen, and Z. Chen. Improving text classification by using encyclopedia knowledge. In ICDM, pages 332--341, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Wong and C. Chan. Chinese word segmentation based on maximum matching and word binding force. In COLING, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Zhang and W. S. Lee. Question classification using support vector machines. In SIGIR, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Zhou, L. Cai, J. Zhao, and K. Liu. Phrase-based translation model for question retrieval in community question answer archives. In ACL, pages 653--662, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
      October 2011
      2712 pages
      ISBN:9781450307178
      DOI:10.1145/2063576

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 October 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader