research-article

Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge

Authors:
Li Cai

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Guangyou Zhou

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Kang Liu

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Jun Zhao

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementOctober 2011Pages 1321–1330https://doi.org/10.1145/2063576.2063768

Published:24 October 2011Publication History

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Pages 1321–1330

ABSTRACT

With the flourishing of community-based question answering (cQA) services like Yahoo! Answers, more and more web users seek their information need from these sites. Understanding user's information need expressed through their search questions is crucial to information providers. Question classification in cQA is studied for this purpose. However, there are two main difficulties in applying traditional methods (question classification in TREC QA and text classification) to cQA: (1) Traditional methods confine themselves to classify a text or question into two or a few predefined categories. While in cQA, the number of categories is much larger, such as Yahoo! Answers, there contains 1,263 categories. Our empirical results show that with the increasing of the number of categories to moderate size, the performance of the classification accuracy dramatically decreases. (2) Unlike the normal texts, questions in cQA are very short, which cannot provide sufficient word co-occurrence or shared information for a good similarity measure due to the data sparseness. In this paper, we propose a two-stage approach for question classification in cQA that can tackle the difficulties of the traditional methods. In the first stage, we preform a search process to prune the large-scale categories to focus our classification effort on a small subset. In the second stage, we enrich questions by leveraging Wikipedia semantic knowledge to tackle the data sparseness. As a result, the classification model is trained on the enriched small subset. We demonstrate the performance of our proposed method on Yahoo! Answers with 1,263 categories. The experimental results show that our proposed method significantly outperforms the baseline method (with error reductions of 23.21%).

References

A. Berger, A. Pietra, and J. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996. Google ScholarDigital Library
X. Cao, G. Cong, B. Cui, and C. S. Jensen. A generalized framework of exploring category information for question retrieval in community question answer archives. In WWW, 2010. Google ScholarDigital Library
Y. Cao, H. Duan, C.-Y. Lin, Y. Yu, and H.-W. Hon. Recommending questions using the mdl-based tree cut model. In WWW, 2008. Google ScholarDigital Library
H. Duan, Y. Cao, C. Y. Lin, and Y. Yu. Searching questions by identifying questions topics and question focus. In ACL, pages 156--164, 2008.Google Scholar
E. Gebrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorication with encyclopedia knowledge. In IJCAI, pages 1301--1306, 2006. Google ScholarDigital Library
A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In SIGIR, 2003.Google Scholar
J. Hu, L. Fang, Y. Cao, H. Zeng, H. Li, Q. Yang, and Z. Chen. Enhancing text clustering by leveraging wikipedia semantics. In SIGIR, 2008. Google ScholarDigital Library
X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploting internal and external semantics for the clustering of short texts using world knowledge. In CIKM, 2009. Google ScholarDigital Library
X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou. Exploiting wikipedia as external knowledge for document clustering. In KDD, 2009. Google ScholarDigital Library
G. Kumaran and J. Allan. Text classification and named entities for new event detection. In SIGIR, pages 297--304, 2004. Google ScholarDigital Library
D. Liu and J. Nocedal. On the limited memory bfgs method for large-scale optimization. Mathematical Programming, 45:503--528, 1989. Google ScholarDigital Library
G. Miller. Wordnet: a lexical database for english. CACM, 38:39--41, 1995. Google ScholarDigital Library
D. Milne, Q. Medelyan, and I. H. Witten. Mining domain-specific thesauri from wikipedia: a case study. In IEEE/WIC/ACM WI, 2006. Google ScholarDigital Library
Z.-Y. Ming, K. Wang, and T.-S. Chua. Prototype hierarchy based clustering for the categorization and navigation of web collection. In SIGIR, 2010. Google ScholarDigital Library
A. Moschitti, S. Quarteroni, R. Basili, and S. Manandhar. Exploiting syntactic and shallow semantic kernels for question/answer classification. In ACL, pages 776--783, 2007.Google Scholar
T. Nguyen, L. Nguyen, and A. Shimazu. Using semi-supervised learning for question classification. Journal of Natural Language Processing, 15(1):3--21, 2008.Google ScholarCross Ref
X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse txt & web with hidden topics from large-scale data collections. In WWW, 2008. Google ScholarDigital Library
S. P. Ponzetto and M. Strube. Deriving a large scale taxonomy from wikipedia. In AAAI, 2007. Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
D. Shen, J.-T. Sun, Q. Yang, and Z. Chen. Text classification improved through multigram models. In CIKM, 2006. Google ScholarDigital Library
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A core semantic knowledge unifying wordnet and wikipedia. In WWW, 2007. Google ScholarDigital Library
K. Wang, Z. Ming, and T.-S. Chua. A syntactic tree matching approach to finding similar questions in community-based qa services. In SIGIR, 2009. Google ScholarDigital Library
P. Wang and C. Domeniconl. Building semantic kernels for text classification using wikipedia. In KDD, 2008. Google ScholarDigital Library
P. Wang, J. Hu, H.-J. Zeng, L. Chen, and Z. Chen. Improving text classification by using encyclopedia knowledge. In ICDM, pages 332--341, 2007. Google ScholarDigital Library
P. Wong and C. Chan. Chinese word segmentation based on maximum matching and word binding force. In COLING, 1996. Google ScholarDigital Library
G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR, 2008. Google ScholarDigital Library
X. Xue, J. Jeon, and W. B. Croft. Retrieval models for question and answer archives. In SIGIR, 2008. Google ScholarDigital Library
D. Zhang and W. S. Lee. Question classification using support vector machines. In SIGIR, 2003.Google ScholarDigital Library
G. Zhou, L. Cai, J. Zhao, and K. Liu. Phrase-based translation model for question retrieval in community question answer archives. In ACL, pages 653--662, 2011. Google ScholarDigital Library

Index Terms

Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge
1. Information systems
  1. Information retrieval

Recommendations

A semantic approach for question classification using WordNet and Wikipedia

Question Answering Systems, unlike search engines, are providing answers to the users' questions in succinct form which requires the prior knowledge of the expectation of the user. Question classification module of a Question Answering System plays a ...
Read More
Learning taxonomy adaptation in large-scale classification

In this paper, we study flat and hierarchical classification strategies in the context of large-scale taxonomies. Addressing the problem from a learning-theoretic point of view, we first propose a multi-class, hierarchical data dependent bound on the ...
Read More
Question-answer topic model for question retrieval in community question answering
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

The major challenge for Question Retrieval (QR) in Community Question Answering (CQA) is the lexical gap between the queried question and the historical questions. This paper proposes a novel Question-Answer Topic Model (QATM) to learn the latent topics ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Editors:
Bettina Berendt,
Arjen de Vries,
Wenfei Fan,
Craig Macdonald
University of Glasgow, UK
,
Iadh Ounis
University of Glasgow, UK
,
Ian Ruthven
University of Strathclyde, UK
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
large-scale classification
question retrieval
translation model
wikipedia
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 676
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

A semantic approach for question classification using WordNet and Wikipedia

Learning taxonomy adaptation in large-scale classification

Question-answer topic model for question retrieval in community question answering