Skip to main content

Machine Learning Identification of Self-reported COVID-19 Symptoms from Tweets in Canada

  • Chapter
  • First Online:
AI for Disease Surveillance and Pandemic Intelligence (W3PHAI 2021)

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1013))

Included in the following conference series:

Abstract

Surveillance of open-source media, such as social media, has become an essential complement to traditional surveillance data for quickly detecting changes in the occurrence of diseases in time and space. We present our method for classifying Tweets into narratives about COVID-19 symptoms to produce a dataset for downstream surveillance applications. A dataset of 10,405 tweets has been manually classified as relevant or not to self-reported symptoms of COVID-19. Five machine learning classification algorithms, with different tokenization methods, were trained on the dataset and tested. The Support vector machine (SVM) algorithm, with a term frequency-inverse document frequency (TF-IDF) 3-4 n-grams on character as the tokenization method, was the classification algorithm with the highest F1-score of 0.70. However, the training dataset showed an imbalanced classification problem. To reduce the bias of the imbalance classes, the crowdsourcing website Mechanical Turk was used to add 133 relevant tweets. This addition improved the F1-score from 0.70 to 0.77.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    However, given that English and French are the two official languages most widely used by various communities on social media in Canada, our later analysis only used tweets of those two languages.

  2. 2.

    http://www.geonames.org.

References

  1. Achrekar, H., Gandhe, A., Lazarus, R., Yu, S.H., Liu, B.: Twitter improves seasonal influenza prediction (2012)

    Google Scholar 

  2. Agarwal, A., Toshniwal, D.: Face off: travel habits, road conditions and traffic city characteristics bared using twitter. IEEE Access 7, 66536–66552 (2019). https://doi.org/10.1109/ACCESS.2019.2917159

    Article  Google Scholar 

  3. Al-garadi, M.A., Khan, M.S., Varathan, K.D., Mujtaba, G., Al-Kabsi, A.M.: Using online social networks to track a pandemic: a systematic review. J. Biomed. Inform. 62, 1–11 (2016). https://doi.org/10.1016/j.jbi.2016.05.005

    Article  Google Scholar 

  4. Ameer, I., Ashraf, N., Sidorov, G., Gomez Adorno, H.: Multi-label emotion classification using content-based features in twitter. Comput. Sistemas 24(3), 1159–1164 (2020). https://doi.org/10.13053/CyS-24-3-3476

  5. Arsevska, E., Valentin, S., Rabatel, J., de Herve, J.dG., Falala, S., Lancelot, R., Roche, M.: Web monitoring of emerging animal infectious diseases integrated in the French Animal Health Epidemic Intelligence System. PLOS ONE 13(8) (2018) . https://doi.org/10.1371/journal.pone.0199960

  6. Ayenigbara, I.O.: COVID-19: An international public health concern. Central Asian J. Global Health 9(1) (2020). https://doi.org/10.5195/cajgh.2020.466

  7. Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 759–760 (2017)

    Google Scholar 

  8. Cesare, N., Grant, C., Nsoesie, E.O.: Understanding demographic bias and representation in social media health data, pp. 7–9 (2019)

    Google Scholar 

  9. Chae, S., Kwon, S., Lee, D.: Predicting infectious disease using deep learning and big data. Int. J. Environ. Res. Public Health 15(8), 1596 (2018)

    Article  Google Scholar 

  10. Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach to geo-locating twitter users. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 759–768 (2010)

    Google Scholar 

  11. Dredze, M., Paul, M.J., Bergsma, S., Tran, H.: Carmen: A twitter geolocation system with applications to public health. In: AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), vol. 23, p. 45 (2013)

    Google Scholar 

  12. Edo-Osagie, O., Smith, G., Lake, I., Edeghere, O., De La Iglesia, B.: Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance. PLOS ONE 14(7), 1–29 (2019)

    Article  Google Scholar 

  13. El Zowalaty, M.E., Järhult, J.D.: From SARS to COVID-19: a previously unknown SARS-related coronavirus (SARS-CoV-2) of pandemic potential infecting humans—Call for a One Health approach. One Health 9 (2020). https://doi.org/10.1016/j.onehlt.2020.100124

  14. Guo, J.W., Radloff, C.L., Wawrzynski, S.E., Cloyes, K.G.: Mining twitter to explore the emergence of COVID-19 symptoms. Public Health Nursing (2020)

    Google Scholar 

  15. Guo, P., Zhang, Q., Chen, Y., Xiao, J., He, J., Zhang, Y., Wang, L., Liu, T., Ma, W.: An ensemble forecast model of dengue in Guangzhou, China using climate and social media surveillance data. Sci. Total Environ. 647, 752–762 (2019)

    Article  Google Scholar 

  16. Jain, V.K., Kumar, S.: Rough set based intelligent approach for identification of H1N1 suspect using social media. Kuwait J. Sci. 45(2), 8–14 (2018)

    Google Scholar 

  17. Kearney, M.W., Kearney, M.M.W.: Package ‘rtweet’ (2016)

    Google Scholar 

  18. Klein, A.Z., Sarker, A., Weissenbacher, D., Gonzalez-Hernandez, G.: Towards scaling twitter for digital epidemiology of birth defects. NPJ Digital Med. 2 (2019). https://doi.org/10.1038/s41746-019-0170-5

  19. Kusumawardani, R.P., Basri, M.H.: Topic identification and categorization of public information in community-based social media. In: 1st International Conference on Computing and Applied Informatics 2016: Applied Informatics Toward Smart Environment, People, And Society, Univ Sumatera Utara, Fac Comp Sci & Informat Technol, IOP Publishing LTD, Bristol. J. Phys. Conf. Ser. 801 (2017). https://doi.org/10.1088/1742-6596/801/1/012075

  20. Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017)

    Google Scholar 

  21. Lemnaru, C., Potolea, R.: Imbalanced classification problems: systematic study, issues and best practices. In: Zhang, R., Zhang, J., Zhang, Z., Filipe, J., Cordeiro, J. (eds.) Enterprise Information Systems, pp. 35–50. Springer, Berlin (2012)

    Chapter  Google Scholar 

  22. Lu, F.S., Hou, S., Baltrusaitis, K., Shah, M., Leskovec, J., Sosic, R., Hawkins, J., Brownstein, J., Conidi, G., Gunn, J., Gray, J., Zink, A., Santillana, M.: Accurate influenza monitoring and forecasting using novel internet data streams: a case study in the boston metropolis. JMIR Public Health Surveill. 4(1), 31–48 (2018). https://doi.org/10.2196/publichealth.8950

    Article  Google Scholar 

  23. Mackey, T., Purushothaman, V., Li, J., Shah, N., Nali, M., Bardier, C., Liang, B., Cai, M., Cuomo, R.: Machine learning to detect self-reporting of symptoms, testing access, and recovery associated with COVID-19 on twitter: retrospective big data infoveillance study. JMIR Public Health Surveill. (2020)

    Google Scholar 

  24. Majumder, M.S., Santillana, M., Mekaru, S.R., McGinnis, D.P., Khan, K., Brownstein, J.S.: Utilizing nontraditional data sources for near real-time estimation of transmission dynamics during the 2015–2016 Colombian zika virus disease outbreak. JMIR Public Health Surveill. 2(1), e30 (2016). https://doi.org/10.2196/publichealth.5814

    Article  Google Scholar 

  25. Masri, S., Jia, J., Li, C., Zhou, G., Lee, M.C., Yan, G., Wu, J.: Use of twitter data to improve Zika virus surveillance in the United States during the 2016 epidemic. BMC Public Health 19(1), 761 (2019)

    Article  Google Scholar 

  26. Miller, M., Banerjee, T., Muppalla, R., Romine, W., Sheth, A.: What are people tweeting about Zika? An exploratory study concerning its symptoms, treatment, transmission, and prevention. JMIR Public Health Surveill. 3(2), e38 (2017)

    Article  Google Scholar 

  27. Mohammad, S.M., Sobhani, P., Kiritchenko, S.: Stance and sentiment in tweets. ACM Trans. Internet Technol. 17(3, SI) (2017). https://doi.org/10.1145/3003433

  28. Nsoesie, E.O., Flor, L., Hawkins, J., Maharana, A., Skotnes, T., Marinho, F., Brownstein, J.S.: Social media as a sentinel for disease surveillance: what does sociodemographic status have to do with it? PLoS Currents Outbreaks 8 (2016). https://doi.org/10.1371/currents.outbreaks.cc09a42586e16dc7dd62813b7ee5d6b6

  29. Odlum, M., Yoon, S.: What can we learn about the Ebola outbreak from tweets? American J. Infect. Control 43(6), 563–571 (2015). https://doi.org/10.1016/j.ajic.2015.02.023

    Article  Google Scholar 

  30. Oriola, O., Kotze, E.: Automatic detection of toxic south african tweets using support vector machines with n-gram features. In: 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI 2019). IEEE; IEEE Syst Man & Cybernet Soc; India Int Congress Computat Intelligence, IEEE, NEW YORK. International Conference on Soft Computing & Machine Intelligence ISCMI, pp. 126–130 (2019)

    Google Scholar 

  31. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  32. de Quincey, E., Kostkova, P.: Early warning and outbreak detection using social networking websites: the potential of twitter (2010)

    Google Scholar 

  33. Ribeiro, S. Jr, Pappa, G.L.: Strategies for combining twitter users geo-location methods. Geoinformatica 22(3, SI), 563–587 (2018). https://doi.org/10.1007/s10707-017-0296-z

  34. Sarkar, D.: Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing. Apress (2019)

    Google Scholar 

  35. Sarker, A., Belousov, M., Friedrichs, J., Hakala, K., Kiritchenko, S., Mehryary, F., Han, S., Tran, T., Rios, A., Kavuluru, R., de Bruijn, B., Ginter, F., Mahata, D., Mohammad, S.M., Nenadic, G., Gonzalez-Hernandez, G.: Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task. J. Am. Med. Inform. Assoc. 25(10), 1274–1283 (2018). https://doi.org/10.1093/jamia/ocy114

    Article  Google Scholar 

  36. Silge, J., Robinson, D.: Text mining with R: a tidy approach (2017)

    Google Scholar 

  37. Stefanidis, A., Vraga, E., Lamprianidis, G., Radzikowski, J., Delamater, P.L., Jacobsen, K.H., Pfoser, D., Croitoru, A., Crooks, A.: Zika in twitter: temporal variations of locations, actors, and concepts. JMIR Public Health Surveill. 3(2), e22 (2017)

    Article  Google Scholar 

  38. Wakamiya, S., Kawai, Y., Aramaki, E.: Twitter-based influenza detection after flu peak via tweets with indirect information: text mining study. JMIR Public Health Surveill. 4(3), e65 (2018). https://doi.org/10.2196/publichealth.8627

    Article  Google Scholar 

  39. Wang, F., Wang, H., Xu, K., Raymond, R., Chon, J., Fuller, S., Debruyn, A.: Regional level influenza study with geo-tagged Twitter data. J. Med. Syst. 40(8), 189 (2016)

    Article  Google Scholar 

  40. WHO: WHO Coronavirus Disease (COVID-19) Dashboard (2020) . https://covid19.who.int

  41. Xue, J., Chen, J., Chen, C., Zheng, C., Li, S., Zhu, T.: Public discourse and sentiment during the COVID 19 pandemic: using Latent Dirichlet allocation for topic modeling on Twitter. PLOS ONE 15(9) (2020). https://doi.org/10.1371/journal.pone.0239441

  42. Yepes, A.J., MacKinlay, A., Han, B.: Investigating public health surveillance using twitter. Proc. BioNLP 15, 164–170 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erin Rees .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Gilbert, JP., Niu, J., de Montigny, S., Ng, V., Rees, E. (2022). Machine Learning Identification of Self-reported COVID-19 Symptoms from Tweets in Canada. In: Shaban-Nejad, A., Michalowski, M., Bianco, S. (eds) AI for Disease Surveillance and Pandemic Intelligence. W3PHAI 2021. Studies in Computational Intelligence, vol 1013. Springer, Cham. https://doi.org/10.1007/978-3-030-93080-6_9

Download citation

Publish with us

Policies and ethics