ABSTRACT
With the flourishing of community-based question answering (cQA) services like Yahoo! Answers, more and more web users seek their information need from these sites. Understanding user's information need expressed through their search questions is crucial to information providers. Question classification in cQA is studied for this purpose. However, there are two main difficulties in applying traditional methods (question classification in TREC QA and text classification) to cQA: (1) Traditional methods confine themselves to classify a text or question into two or a few predefined categories. While in cQA, the number of categories is much larger, such as Yahoo! Answers, there contains 1,263 categories. Our empirical results show that with the increasing of the number of categories to moderate size, the performance of the classification accuracy dramatically decreases. (2) Unlike the normal texts, questions in cQA are very short, which cannot provide sufficient word co-occurrence or shared information for a good similarity measure due to the data sparseness. In this paper, we propose a two-stage approach for question classification in cQA that can tackle the difficulties of the traditional methods. In the first stage, we preform a search process to prune the large-scale categories to focus our classification effort on a small subset. In the second stage, we enrich questions by leveraging Wikipedia semantic knowledge to tackle the data sparseness. As a result, the classification model is trained on the enriched small subset. We demonstrate the performance of our proposed method on Yahoo! Answers with 1,263 categories. The experimental results show that our proposed method significantly outperforms the baseline method (with error reductions of 23.21%).
- A. Berger, A. Pietra, and J. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996. Google ScholarDigital Library
- X. Cao, G. Cong, B. Cui, and C. S. Jensen. A generalized framework of exploring category information for question retrieval in community question answer archives. In WWW, 2010. Google ScholarDigital Library
- Y. Cao, H. Duan, C.-Y. Lin, Y. Yu, and H.-W. Hon. Recommending questions using the mdl-based tree cut model. In WWW, 2008. Google ScholarDigital Library
- H. Duan, Y. Cao, C. Y. Lin, and Y. Yu. Searching questions by identifying questions topics and question focus. In ACL, pages 156--164, 2008.Google Scholar
- E. Gebrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorication with encyclopedia knowledge. In IJCAI, pages 1301--1306, 2006. Google ScholarDigital Library
- A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In SIGIR, 2003.Google Scholar
- J. Hu, L. Fang, Y. Cao, H. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In SIGIR, 2008. Google ScholarDigital Library
- X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploting internal and external semantics for the clustering of short texts using world knowledge. In CIKM, 2009. Google ScholarDigital Library
- X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou. Exploiting wikipedia as external knowledge for document clustering. In KDD, 2009. Google ScholarDigital Library
- G. Kumaran and J. Allan. Text classification and named entities for new event detection. In SIGIR, pages 297--304, 2004. Google ScholarDigital Library
- D. Liu and J. Nocedal. On the limited memory bfgs method for large-scale optimization. Mathematical Programming, 45:503--528, 1989. Google ScholarDigital Library
- G. Miller. Wordnet: a lexical database for english. CACM, 38:39--41, 1995. Google ScholarDigital Library
- D. Milne, Q. Medelyan, and I. H. Witten. Mining domain-specific thesauri from wikipedia: a case study. In IEEE/WIC/ACM WI, 2006. Google ScholarDigital Library
- Z.-Y. Ming, K. Wang, and T.-S. Chua. Prototype hierarchy based clustering for the categorization and navigation of web collection. In SIGIR, 2010. Google ScholarDigital Library
- A. Moschitti, S. Quarteroni, R. Basili, and S. Manandhar. Exploiting syntactic and shallow semantic kernels for question/answer classification. In ACL, pages 776--783, 2007.Google Scholar
- T. Nguyen, L. Nguyen, and A. Shimazu. Using semi-supervised learning for question classification. Journal of Natural Language Processing, 15(1):3--21, 2008.Google ScholarCross Ref
- X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse txt & web with hidden topics from large-scale data collections. In WWW, 2008. Google ScholarDigital Library
- S. P. Ponzetto and M. Strube. Deriving a large scale taxonomy from wikipedia. In AAAI, 2007. Google ScholarDigital Library
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
- D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. Text classification improved through multigram models. In CIKM, 2006. Google ScholarDigital Library
- F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core semantic knowledge unifying wordnet and wikipedia. In WWW, 2007. Google ScholarDigital Library
- K. Wang, Z. Ming, and T.-S. Chua. A syntactic tree matching approach to finding similar questions in community-based qa services. In SIGIR, 2009. Google ScholarDigital Library
- P. Wang and C. Domeniconl. Building semantic kernels for text classification using wikipedia. In KDD, 2008. Google ScholarDigital Library
- P. Wang, J. Hu, H.-J. Zeng, L. Chen, and Z. Chen. Improving text classification by using encyclopedia knowledge. In ICDM, pages 332--341, 2007. Google ScholarDigital Library
- P. Wong and C. Chan. Chinese word segmentation based on maximum matching and word binding force. In COLING, 1996. Google ScholarDigital Library
- G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR, 2008. Google ScholarDigital Library
- X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR, 2008. Google ScholarDigital Library
- D. Zhang and W. S. Lee. Question classification using support vector machines. In SIGIR, 2003.Google ScholarDigital Library
- G. Zhou, L. Cai, J. Zhao, and K. Liu. Phrase-based translation model for question retrieval in community question answer archives. In ACL, pages 653--662, 2011. Google ScholarDigital Library
Index Terms
- Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge
Recommendations
A semantic approach for question classification using WordNet and Wikipedia
Question Answering Systems, unlike search engines, are providing answers to the users' questions in succinct form which requires the prior knowledge of the expectation of the user. Question classification module of a Question Answering System plays a ...
Learning taxonomy adaptation in large-scale classification
In this paper, we study flat and hierarchical classification strategies in the context of large-scale taxonomies. Addressing the problem from a learning-theoretic point of view, we first propose a multi-class, hierarchical data dependent bound on the ...
Question-answer topic model for question retrieval in community question answering
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementThe major challenge for Question Retrieval (QR) in Community Question Answering (CQA) is the lexical gap between the queried question and the historical questions. This paper proposes a novel Question-Answer Topic Model (QATM) to learn the latent topics ...
Comments