Reference Hub1
Creating Paraphrase Identification Corpus for Indian Languages: Opensource Data Set for Paraphrase Creation

Creating Paraphrase Identification Corpus for Indian Languages: Opensource Data Set for Paraphrase Creation

Anand Kumar M., Shivkaran Singh, Praveena Ramanan, Vaithehi Sinthiya, Soman K. P.
ISBN13: 9781522596431|ISBN10: 1522596437|ISBN13 Softcover: 9781522596448|EISBN13: 9781522596455
DOI: 10.4018/978-1-5225-9643-1.ch008
Cite Chapter Cite Chapter

MLA

M., Anand Kumar, et al. "Creating Paraphrase Identification Corpus for Indian Languages: Opensource Data Set for Paraphrase Creation." Handbook of Research on Emerging Trends and Applications of Machine Learning, edited by Arun Solanki, et al., IGI Global, 2020, pp. 157-170. https://doi.org/10.4018/978-1-5225-9643-1.ch008

APA

M., A. K., Singh, S., Ramanan, P., Sinthiya, V., & K. P., S. (2020). Creating Paraphrase Identification Corpus for Indian Languages: Opensource Data Set for Paraphrase Creation. In A. Solanki, S. Kumar, & A. Nayyar (Eds.), Handbook of Research on Emerging Trends and Applications of Machine Learning (pp. 157-170). IGI Global. https://doi.org/10.4018/978-1-5225-9643-1.ch008

Chicago

M., Anand Kumar, et al. "Creating Paraphrase Identification Corpus for Indian Languages: Opensource Data Set for Paraphrase Creation." In Handbook of Research on Emerging Trends and Applications of Machine Learning, edited by Arun Solanki, Sandeep Kumar, and Anand Nayyar, 157-170. Hershey, PA: IGI Global, 2020. https://doi.org/10.4018/978-1-5225-9643-1.ch008

Export Reference

Mendeley
Favorite

Abstract

In recent times, paraphrase identification task has got the attention of the research community. The paraphrase is a phrase or sentence that conveys the same information but using different words or syntactic structure. The Microsoft Research Paraphrase Corpus (MSRP) is a well-known openly available paraphrase corpus of the English language. There is no such publicly available paraphrase corpus for any Indian language (as of now). This chapter explains the creation of paraphrase corpus for Hindi, Tamil, Malayalam, and Punjabi languages. This is the first publicly available corpus for any Indian language. It was used in the shared task on detecting paraphrases for Indian languages (DPIL) held in conjunction with Forum for Information Retrieval & Evaluation (FIRE) 2016. The annotation process was performed by a postgraduate student followed by a two-step proofreading by a linguist and a language expert.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.