Machine Learning Identification of Self-reported COVID-19 Symptoms from Tweets in Canada

Gilbert, Jean-Philippe; Niu, Jingcheng; de Montigny, Simon; Ng, Victoria; Rees, Erin

doi:10.1007/978-3-030-93080-6_9

Jean-Philippe Gilbert⁵,
Jingcheng Niu⁶,
Simon de Montigny⁷,
Victoria Ng⁸ &
…
Erin Rees⁸

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1013))

Included in the following conference series:

International Workshop on Health Intelligence

515 Accesses
1 Citations

Abstract

Surveillance of open-source media, such as social media, has become an essential complement to traditional surveillance data for quickly detecting changes in the occurrence of diseases in time and space. We present our method for classifying Tweets into narratives about COVID-19 symptoms to produce a dataset for downstream surveillance applications. A dataset of 10,405 tweets has been manually classified as relevant or not to self-reported symptoms of COVID-19. Five machine learning classification algorithms, with different tokenization methods, were trained on the dataset and tested. The Support vector machine (SVM) algorithm, with a term frequency-inverse document frequency (TF-IDF) 3-4 n-grams on character as the tokenization method, was the classification algorithm with the highest F1-score of 0.70. However, the training dataset showed an imbalanced classification problem. To reduce the bias of the imbalance classes, the crowdsourcing website Mechanical Turk was used to add 133 relevant tweets. This addition improved the F1-score from 0.70 to 0.77.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
However, given that English and French are the two official languages most widely used by various communities on social media in Canada, our later analysis only used tweets of those two languages.
2.
http://www.geonames.org.

References

Achrekar, H., Gandhe, A., Lazarus, R., Yu, S.H., Liu, B.: Twitter improves seasonal influenza prediction (2012)
Google Scholar
Agarwal, A., Toshniwal, D.: Face off: travel habits, road conditions and traffic city characteristics bared using twitter. IEEE Access 7, 66536–66552 (2019). https://doi.org/10.1109/ACCESS.2019.2917159
Article Google Scholar
Al-garadi, M.A., Khan, M.S., Varathan, K.D., Mujtaba, G., Al-Kabsi, A.M.: Using online social networks to track a pandemic: a systematic review. J. Biomed. Inform. 62, 1–11 (2016). https://doi.org/10.1016/j.jbi.2016.05.005
Article Google Scholar
Ameer, I., Ashraf, N., Sidorov, G., Gomez Adorno, H.: Multi-label emotion classification using content-based features in twitter. Comput. Sistemas 24(3), 1159–1164 (2020). https://doi.org/10.13053/CyS-24-3-3476
Arsevska, E., Valentin, S., Rabatel, J., de Herve, J.dG., Falala, S., Lancelot, R., Roche, M.: Web monitoring of emerging animal infectious diseases integrated in the French Animal Health Epidemic Intelligence System. PLOS ONE 13(8) (2018) . https://doi.org/10.1371/journal.pone.0199960
Ayenigbara, I.O.: COVID-19: An international public health concern. Central Asian J. Global Health 9(1) (2020). https://doi.org/10.5195/cajgh.2020.466
Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 759–760 (2017)
Google Scholar
Cesare, N., Grant, C., Nsoesie, E.O.: Understanding demographic bias and representation in social media health data, pp. 7–9 (2019)
Google Scholar
Chae, S., Kwon, S., Lee, D.: Predicting infectious disease using deep learning and big data. Int. J. Environ. Res. Public Health 15(8), 1596 (2018)
Article Google Scholar
Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach to geo-locating twitter users. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 759–768 (2010)
Google Scholar
Dredze, M., Paul, M.J., Bergsma, S., Tran, H.: Carmen: A twitter geolocation system with applications to public health. In: AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), vol. 23, p. 45 (2013)
Google Scholar
Edo-Osagie, O., Smith, G., Lake, I., Edeghere, O., De La Iglesia, B.: Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance. PLOS ONE 14(7), 1–29 (2019)
Article Google Scholar
El Zowalaty, M.E., Järhult, J.D.: From SARS to COVID-19: a previously unknown SARS-related coronavirus (SARS-CoV-2) of pandemic potential infecting humans—Call for a One Health approach. One Health 9 (2020). https://doi.org/10.1016/j.onehlt.2020.100124
Guo, J.W., Radloff, C.L., Wawrzynski, S.E., Cloyes, K.G.: Mining twitter to explore the emergence of COVID-19 symptoms. Public Health Nursing (2020)
Google Scholar
Guo, P., Zhang, Q., Chen, Y., Xiao, J., He, J., Zhang, Y., Wang, L., Liu, T., Ma, W.: An ensemble forecast model of dengue in Guangzhou, China using climate and social media surveillance data. Sci. Total Environ. 647, 752–762 (2019)
Article Google Scholar
Jain, V.K., Kumar, S.: Rough set based intelligent approach for identification of H1N1 suspect using social media. Kuwait J. Sci. 45(2), 8–14 (2018)
Google Scholar
Kearney, M.W., Kearney, M.M.W.: Package ‘rtweet’ (2016)
Google Scholar
Klein, A.Z., Sarker, A., Weissenbacher, D., Gonzalez-Hernandez, G.: Towards scaling twitter for digital epidemiology of birth defects. NPJ Digital Med. 2 (2019). https://doi.org/10.1038/s41746-019-0170-5
Kusumawardani, R.P., Basri, M.H.: Topic identification and categorization of public information in community-based social media. In: 1st International Conference on Computing and Applied Informatics 2016: Applied Informatics Toward Smart Environment, People, And Society, Univ Sumatera Utara, Fac Comp Sci & Informat Technol, IOP Publishing LTD, Bristol. J. Phys. Conf. Ser. 801 (2017). https://doi.org/10.1088/1742-6596/801/1/012075
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017)
Google Scholar
Lemnaru, C., Potolea, R.: Imbalanced classification problems: systematic study, issues and best practices. In: Zhang, R., Zhang, J., Zhang, Z., Filipe, J., Cordeiro, J. (eds.) Enterprise Information Systems, pp. 35–50. Springer, Berlin (2012)
Chapter Google Scholar
Lu, F.S., Hou, S., Baltrusaitis, K., Shah, M., Leskovec, J., Sosic, R., Hawkins, J., Brownstein, J., Conidi, G., Gunn, J., Gray, J., Zink, A., Santillana, M.: Accurate influenza monitoring and forecasting using novel internet data streams: a case study in the boston metropolis. JMIR Public Health Surveill. 4(1), 31–48 (2018). https://doi.org/10.2196/publichealth.8950
Article Google Scholar
Mackey, T., Purushothaman, V., Li, J., Shah, N., Nali, M., Bardier, C., Liang, B., Cai, M., Cuomo, R.: Machine learning to detect self-reporting of symptoms, testing access, and recovery associated with COVID-19 on twitter: retrospective big data infoveillance study. JMIR Public Health Surveill. (2020)
Google Scholar
Majumder, M.S., Santillana, M., Mekaru, S.R., McGinnis, D.P., Khan, K., Brownstein, J.S.: Utilizing nontraditional data sources for near real-time estimation of transmission dynamics during the 2015–2016 Colombian zika virus disease outbreak. JMIR Public Health Surveill. 2(1), e30 (2016). https://doi.org/10.2196/publichealth.5814
Article Google Scholar
Masri, S., Jia, J., Li, C., Zhou, G., Lee, M.C., Yan, G., Wu, J.: Use of twitter data to improve Zika virus surveillance in the United States during the 2016 epidemic. BMC Public Health 19(1), 761 (2019)
Article Google Scholar
Miller, M., Banerjee, T., Muppalla, R., Romine, W., Sheth, A.: What are people tweeting about Zika? An exploratory study concerning its symptoms, treatment, transmission, and prevention. JMIR Public Health Surveill. 3(2), e38 (2017)
Article Google Scholar
Mohammad, S.M., Sobhani, P., Kiritchenko, S.: Stance and sentiment in tweets. ACM Trans. Internet Technol. 17(3, SI) (2017). https://doi.org/10.1145/3003433
Nsoesie, E.O., Flor, L., Hawkins, J., Maharana, A., Skotnes, T., Marinho, F., Brownstein, J.S.: Social media as a sentinel for disease surveillance: what does sociodemographic status have to do with it? PLoS Currents Outbreaks 8 (2016). https://doi.org/10.1371/currents.outbreaks.cc09a42586e16dc7dd62813b7ee5d6b6
Odlum, M., Yoon, S.: What can we learn about the Ebola outbreak from tweets? American J. Infect. Control 43(6), 563–571 (2015). https://doi.org/10.1016/j.ajic.2015.02.023
Article Google Scholar
Oriola, O., Kotze, E.: Automatic detection of toxic south african tweets using support vector machines with n-gram features. In: 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI 2019). IEEE; IEEE Syst Man & Cybernet Soc; India Int Congress Computat Intelligence, IEEE, NEW YORK. International Conference on Soft Computing & Machine Intelligence ISCMI, pp. 126–130 (2019)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
de Quincey, E., Kostkova, P.: Early warning and outbreak detection using social networking websites: the potential of twitter (2010)
Google Scholar
Ribeiro, S. Jr, Pappa, G.L.: Strategies for combining twitter users geo-location methods. Geoinformatica 22(3, SI), 563–587 (2018). https://doi.org/10.1007/s10707-017-0296-z
Sarkar, D.: Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing. Apress (2019)
Google Scholar
Sarker, A., Belousov, M., Friedrichs, J., Hakala, K., Kiritchenko, S., Mehryary, F., Han, S., Tran, T., Rios, A., Kavuluru, R., de Bruijn, B., Ginter, F., Mahata, D., Mohammad, S.M., Nenadic, G., Gonzalez-Hernandez, G.: Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task. J. Am. Med. Inform. Assoc. 25(10), 1274–1283 (2018). https://doi.org/10.1093/jamia/ocy114
Article Google Scholar
Silge, J., Robinson, D.: Text mining with R: a tidy approach (2017)
Google Scholar
Stefanidis, A., Vraga, E., Lamprianidis, G., Radzikowski, J., Delamater, P.L., Jacobsen, K.H., Pfoser, D., Croitoru, A., Crooks, A.: Zika in twitter: temporal variations of locations, actors, and concepts. JMIR Public Health Surveill. 3(2), e22 (2017)
Article Google Scholar
Wakamiya, S., Kawai, Y., Aramaki, E.: Twitter-based influenza detection after flu peak via tweets with indirect information: text mining study. JMIR Public Health Surveill. 4(3), e65 (2018). https://doi.org/10.2196/publichealth.8627
Article Google Scholar
Wang, F., Wang, H., Xu, K., Raymond, R., Chon, J., Fuller, S., Debruyn, A.: Regional level influenza study with geo-tagged Twitter data. J. Med. Syst. 40(8), 189 (2016)
Article Google Scholar
WHO: WHO Coronavirus Disease (COVID-19) Dashboard (2020) . https://covid19.who.int
Xue, J., Chen, J., Chen, C., Zheng, C., Li, S., Zhu, T.: Public discourse and sentiment during the COVID 19 pandemic: using Latent Dirichlet allocation for topic modeling on Twitter. PLOS ONE 15(9) (2020). https://doi.org/10.1371/journal.pone.0239441
Yepes, A.J., MacKinlay, A., Han, B.: Investigating public health surveillance using twitter. Proc. BioNLP 15, 164–170 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laval University, Quebec, Canada
Jean-Philippe Gilbert
University of Toronto, Toronto, Canada
Jingcheng Niu
School of Public Health, University of Montreal, Montreal, Canada
Simon de Montigny
Public Health Agency of Canada, Ottawa, Canada
Victoria Ng & Erin Rees

Authors

Jean-Philippe Gilbert
View author publications
You can also search for this author in PubMed Google Scholar
Jingcheng Niu
View author publications
You can also search for this author in PubMed Google Scholar
Simon de Montigny
View author publications
You can also search for this author in PubMed Google Scholar
Victoria Ng
View author publications
You can also search for this author in PubMed Google Scholar
Erin Rees
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erin Rees .

Editor information

Editors and Affiliations

Oak-Ridge National Lab (ORNL), Department of Pediatrics, Center for Biomedical Informatics, College of Medicine, The University of Tennessee Health Science Center (UTHSC), Memphis, TN, USA
Arash Shaban-Nejad
School of Nursing, University of Minnesota, Minneapolis, MN, USA
Martin Michalowski
IBM Almaden Research Center, San Jos, CA, USA
Simone Bianco

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gilbert, JP., Niu, J., de Montigny, S., Ng, V., Rees, E. (2022). Machine Learning Identification of Self-reported COVID-19 Symptoms from Tweets in Canada. In: Shaban-Nejad, A., Michalowski, M., Bianco, S. (eds) AI for Disease Surveillance and Pandemic Intelligence. W3PHAI 2021. Studies in Computational Intelligence, vol 1013. Springer, Cham. https://doi.org/10.1007/978-3-030-93080-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-93080-6_9
Published: 09 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93079-0
Online ISBN: 978-3-030-93080-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics