Building a Question-Answering Corpus Using Social Media and News Articles

Cavalin, Paulo; Figueiredo, Flavio; de Bayser, Maíra; Moyano, Luis; Candello, Heloisa; Appel, Ana; Souza, Renan

doi:10.1007/978-3-319-41552-9_36

Building a Question-Answering Corpus Using Social Media and News Articles

Paulo Cavalin¹⁸,
Flavio Figueiredo¹⁸,
Maíra de Bayser¹⁸,
Luis Moyano¹⁸,
Heloisa Candello¹⁸,
Ana Appel¹⁸ &
…
Renan Souza¹⁸

Conference paper
First Online: 21 June 2016

679 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9727))

Abstract

Is it possible to develop a reliable QA-Corpus using social media data? What are the challenges faced when attempting such a task? In this paper, we discuss these questions and present our findings when developing a QA-Corpus on the topic of Brazilian finance. In order to populate our corpus, we relied on opinions from experts on Brazilian finance that are active on the Twitter application. From these experts, we extracted information from news websites that are used as answers in the corpus. Moreover, to effectively provide rankings of answers to questions, we employ novel word vector based similarity measures between short sentences (that accounts for both questions and Tweets). We validated our methods on a recently released dataset of similarity between short Portuguese sentences. Finally, we also discuss the effectiveness of our approach when used to rank answers to questions from real users.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Uniform Resource Locator.
2.
ASSIN: Avaliação de Similaridade Semântica e Inferência Textual - http://propor2016.di.fc.ul.pt/?page_id=381.
3.
Dump of 12 December 2015.

References

Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)
Google Scholar
Dow, S.P., Mehta, M., MacIntyre, B., Mateas, M.: Eliza meets the wizard-of-oz: blending machine and human control of embodied characters. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 547–556. ACM (2010)
Google Scholar
Hajjem, M., Trabelsi, M., Latiri, C.: Building comparable corpora from social networks. In: BUCC, 7th Workshop on Building and Using Comparable Corpora, LREC, Reykjavik, Iceland (2013)
Google Scholar
Kenter, T., de Rijke, M.: Short text similarity with word embeddings. In: CIKM 2015: 24th ACM Conference on Information and Knowledge Management. ACM, October 2015
Google Scholar
Ljubešic, N., Fišer, D., Erjavec, T.: Tweet-cat: a tool for building twitter corpora of smaller languages. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland. European Language Resources Association (ELRA) (2014)
Google Scholar
Nothman, J., Murphy, T., Curran, J.R.: Analysing wikipedia and gold-standard corpora for ner training. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 612–620. Association for Computational Linguistics (2009)
Google Scholar
Paul, S., Hong, L., Chi, E.: Is twitter a good place for asking questions? a characterization study. In: International AAAI Conference on Web and Social Media (2011)
Google Scholar
Singh, V., Dwivedi, S.K.: Question answering: a survey of research, techniques and issues. Int. J. Inf. Retrieval Res. (IJIRR) 4(3), 14–33 (2014)
Google Scholar
Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor networks for knowledge base completion. In: Advances in Neural Information Processing Systems, pp. 926–934 (2013)
Google Scholar
Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Google Scholar
Zafar, M.B., Bhattacharya, P., Ganguly, N., Gummadi, K.P., Ghosh, S.: Sampling content from online social networks: comparing random vs. expert sampling of the twitter stream. ACM Trans. Web (TWEB) 9(3), 12 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research, São Paulo, Brazil
Paulo Cavalin, Flavio Figueiredo, Maíra de Bayser, Luis Moyano, Heloisa Candello, Ana Appel & Renan Souza

Authors

Paulo Cavalin
View author publications
You can also search for this author in PubMed Google Scholar
Flavio Figueiredo
View author publications
You can also search for this author in PubMed Google Scholar
Maíra de Bayser
View author publications
You can also search for this author in PubMed Google Scholar
Luis Moyano
View author publications
You can also search for this author in PubMed Google Scholar
Heloisa Candello
View author publications
You can also search for this author in PubMed Google Scholar
Ana Appel
View author publications
You can also search for this author in PubMed Google Scholar
Renan Souza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paulo Cavalin .

Editor information

Editors and Affiliations

Universidade de Lisbon, Portugal
João Silva
ISCTE-IUL, Lisbon, Portugal
Ricardo Ribeiro
Universidade de Évora, Évora, Portugal
Paulo Quaresma
Universidade de Caxias do Sul, Caxias do Suö, Brazil
André Adami
Universidade de Lisbon, Lisboa, Portugal
António Branco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cavalin, P. et al. (2016). Building a Question-Answering Corpus Using Social Media and News Articles. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-41552-9_36
Published: 21 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics