Abstract
Surveillance of open-source media, such as social media, has become an essential complement to traditional surveillance data for quickly detecting changes in the occurrence of diseases in time and space. We present our method for classifying Tweets into narratives about COVID-19 symptoms to produce a dataset for downstream surveillance applications. A dataset of 10,405 tweets has been manually classified as relevant or not to self-reported symptoms of COVID-19. Five machine learning classification algorithms, with different tokenization methods, were trained on the dataset and tested. The Support vector machine (SVM) algorithm, with a term frequency-inverse document frequency (TF-IDF) 3-4 n-grams on character as the tokenization method, was the classification algorithm with the highest F1-score of 0.70. However, the training dataset showed an imbalanced classification problem. To reduce the bias of the imbalance classes, the crowdsourcing website Mechanical Turk was used to add 133 relevant tweets. This addition improved the F1-score from 0.70 to 0.77.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
However, given that English and French are the two official languages most widely used by various communities on social media in Canada, our later analysis only used tweets of those two languages.
- 2.
References
Achrekar, H., Gandhe, A., Lazarus, R., Yu, S.H., Liu, B.: Twitter improves seasonal influenza prediction (2012)
Agarwal, A., Toshniwal, D.: Face off: travel habits, road conditions and traffic city characteristics bared using twitter. IEEE Access 7, 66536–66552 (2019). https://doi.org/10.1109/ACCESS.2019.2917159
Al-garadi, M.A., Khan, M.S., Varathan, K.D., Mujtaba, G., Al-Kabsi, A.M.: Using online social networks to track a pandemic: a systematic review. J. Biomed. Inform. 62, 1–11 (2016). https://doi.org/10.1016/j.jbi.2016.05.005
Ameer, I., Ashraf, N., Sidorov, G., Gomez Adorno, H.: Multi-label emotion classification using content-based features in twitter. Comput. Sistemas 24(3), 1159–1164 (2020). https://doi.org/10.13053/CyS-24-3-3476
Arsevska, E., Valentin, S., Rabatel, J., de Herve, J.dG., Falala, S., Lancelot, R., Roche, M.: Web monitoring of emerging animal infectious diseases integrated in the French Animal Health Epidemic Intelligence System. PLOS ONE 13(8) (2018) . https://doi.org/10.1371/journal.pone.0199960
Ayenigbara, I.O.: COVID-19: An international public health concern. Central Asian J. Global Health 9(1) (2020). https://doi.org/10.5195/cajgh.2020.466
Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion, pp. 759–760 (2017)
Cesare, N., Grant, C., Nsoesie, E.O.: Understanding demographic bias and representation in social media health data, pp. 7–9 (2019)
Chae, S., Kwon, S., Lee, D.: Predicting infectious disease using deep learning and big data. Int. J. Environ. Res. Public Health 15(8), 1596 (2018)
Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach to geo-locating twitter users. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 759–768 (2010)
Dredze, M., Paul, M.J., Bergsma, S., Tran, H.: Carmen: A twitter geolocation system with applications to public health. In: AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), vol. 23, p. 45 (2013)
Edo-Osagie, O., Smith, G., Lake, I., Edeghere, O., De La Iglesia, B.: Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance. PLOS ONE 14(7), 1–29 (2019)
El Zowalaty, M.E., Järhult, J.D.: From SARS to COVID-19: a previously unknown SARS-related coronavirus (SARS-CoV-2) of pandemic potential infecting humans—Call for a One Health approach. One Health 9 (2020). https://doi.org/10.1016/j.onehlt.2020.100124
Guo, J.W., Radloff, C.L., Wawrzynski, S.E., Cloyes, K.G.: Mining twitter to explore the emergence of COVID-19 symptoms. Public Health Nursing (2020)
Guo, P., Zhang, Q., Chen, Y., Xiao, J., He, J., Zhang, Y., Wang, L., Liu, T., Ma, W.: An ensemble forecast model of dengue in Guangzhou, China using climate and social media surveillance data. Sci. Total Environ. 647, 752–762 (2019)
Jain, V.K., Kumar, S.: Rough set based intelligent approach for identification of H1N1 suspect using social media. Kuwait J. Sci. 45(2), 8–14 (2018)
Kearney, M.W., Kearney, M.M.W.: Package ‘rtweet’ (2016)
Klein, A.Z., Sarker, A., Weissenbacher, D., Gonzalez-Hernandez, G.: Towards scaling twitter for digital epidemiology of birth defects. NPJ Digital Med. 2 (2019). https://doi.org/10.1038/s41746-019-0170-5
Kusumawardani, R.P., Basri, M.H.: Topic identification and categorization of public information in community-based social media. In: 1st International Conference on Computing and Applied Informatics 2016: Applied Informatics Toward Smart Environment, People, And Society, Univ Sumatera Utara, Fac Comp Sci & Informat Technol, IOP Publishing LTD, Bristol. J. Phys. Conf. Ser. 801 (2017). https://doi.org/10.1088/1742-6596/801/1/012075
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017)
Lemnaru, C., Potolea, R.: Imbalanced classification problems: systematic study, issues and best practices. In: Zhang, R., Zhang, J., Zhang, Z., Filipe, J., Cordeiro, J. (eds.) Enterprise Information Systems, pp. 35–50. Springer, Berlin (2012)
Lu, F.S., Hou, S., Baltrusaitis, K., Shah, M., Leskovec, J., Sosic, R., Hawkins, J., Brownstein, J., Conidi, G., Gunn, J., Gray, J., Zink, A., Santillana, M.: Accurate influenza monitoring and forecasting using novel internet data streams: a case study in the boston metropolis. JMIR Public Health Surveill. 4(1), 31–48 (2018). https://doi.org/10.2196/publichealth.8950
Mackey, T., Purushothaman, V., Li, J., Shah, N., Nali, M., Bardier, C., Liang, B., Cai, M., Cuomo, R.: Machine learning to detect self-reporting of symptoms, testing access, and recovery associated with COVID-19 on twitter: retrospective big data infoveillance study. JMIR Public Health Surveill. (2020)
Majumder, M.S., Santillana, M., Mekaru, S.R., McGinnis, D.P., Khan, K., Brownstein, J.S.: Utilizing nontraditional data sources for near real-time estimation of transmission dynamics during the 2015–2016 Colombian zika virus disease outbreak. JMIR Public Health Surveill. 2(1), e30 (2016). https://doi.org/10.2196/publichealth.5814
Masri, S., Jia, J., Li, C., Zhou, G., Lee, M.C., Yan, G., Wu, J.: Use of twitter data to improve Zika virus surveillance in the United States during the 2016 epidemic. BMC Public Health 19(1), 761 (2019)
Miller, M., Banerjee, T., Muppalla, R., Romine, W., Sheth, A.: What are people tweeting about Zika? An exploratory study concerning its symptoms, treatment, transmission, and prevention. JMIR Public Health Surveill. 3(2), e38 (2017)
Mohammad, S.M., Sobhani, P., Kiritchenko, S.: Stance and sentiment in tweets. ACM Trans. Internet Technol. 17(3, SI) (2017). https://doi.org/10.1145/3003433
Nsoesie, E.O., Flor, L., Hawkins, J., Maharana, A., Skotnes, T., Marinho, F., Brownstein, J.S.: Social media as a sentinel for disease surveillance: what does sociodemographic status have to do with it? PLoS Currents Outbreaks 8 (2016). https://doi.org/10.1371/currents.outbreaks.cc09a42586e16dc7dd62813b7ee5d6b6
Odlum, M., Yoon, S.: What can we learn about the Ebola outbreak from tweets? American J. Infect. Control 43(6), 563–571 (2015). https://doi.org/10.1016/j.ajic.2015.02.023
Oriola, O., Kotze, E.: Automatic detection of toxic south african tweets using support vector machines with n-gram features. In: 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI 2019). IEEE; IEEE Syst Man & Cybernet Soc; India Int Congress Computat Intelligence, IEEE, NEW YORK. International Conference on Soft Computing & Machine Intelligence ISCMI, pp. 126–130 (2019)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
de Quincey, E., Kostkova, P.: Early warning and outbreak detection using social networking websites: the potential of twitter (2010)
Ribeiro, S. Jr, Pappa, G.L.: Strategies for combining twitter users geo-location methods. Geoinformatica 22(3, SI), 563–587 (2018). https://doi.org/10.1007/s10707-017-0296-z
Sarkar, D.: Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing. Apress (2019)
Sarker, A., Belousov, M., Friedrichs, J., Hakala, K., Kiritchenko, S., Mehryary, F., Han, S., Tran, T., Rios, A., Kavuluru, R., de Bruijn, B., Ginter, F., Mahata, D., Mohammad, S.M., Nenadic, G., Gonzalez-Hernandez, G.: Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task. J. Am. Med. Inform. Assoc. 25(10), 1274–1283 (2018). https://doi.org/10.1093/jamia/ocy114
Silge, J., Robinson, D.: Text mining with R: a tidy approach (2017)
Stefanidis, A., Vraga, E., Lamprianidis, G., Radzikowski, J., Delamater, P.L., Jacobsen, K.H., Pfoser, D., Croitoru, A., Crooks, A.: Zika in twitter: temporal variations of locations, actors, and concepts. JMIR Public Health Surveill. 3(2), e22 (2017)
Wakamiya, S., Kawai, Y., Aramaki, E.: Twitter-based influenza detection after flu peak via tweets with indirect information: text mining study. JMIR Public Health Surveill. 4(3), e65 (2018). https://doi.org/10.2196/publichealth.8627
Wang, F., Wang, H., Xu, K., Raymond, R., Chon, J., Fuller, S., Debruyn, A.: Regional level influenza study with geo-tagged Twitter data. J. Med. Syst. 40(8), 189 (2016)
WHO: WHO Coronavirus Disease (COVID-19) Dashboard (2020) . https://covid19.who.int
Xue, J., Chen, J., Chen, C., Zheng, C., Li, S., Zhu, T.: Public discourse and sentiment during the COVID 19 pandemic: using Latent Dirichlet allocation for topic modeling on Twitter. PLOS ONE 15(9) (2020). https://doi.org/10.1371/journal.pone.0239441
Yepes, A.J., MacKinlay, A., Han, B.: Investigating public health surveillance using twitter. Proc. BioNLP 15, 164–170 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Gilbert, JP., Niu, J., de Montigny, S., Ng, V., Rees, E. (2022). Machine Learning Identification of Self-reported COVID-19 Symptoms from Tweets in Canada. In: Shaban-Nejad, A., Michalowski, M., Bianco, S. (eds) AI for Disease Surveillance and Pandemic Intelligence. W3PHAI 2021. Studies in Computational Intelligence, vol 1013. Springer, Cham. https://doi.org/10.1007/978-3-030-93080-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-93080-6_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93079-0
Online ISBN: 978-3-030-93080-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)