Data augmentation strategies to improve text classification: a use case in smart cities

Bencke, Luciana; Moreira, Viviane Pereira

doi:10.1007/s10579-023-09685-w

Data augmentation strategies to improve text classification: a use case in smart cities

Original Paper
Published: 23 August 2023

Volume 58, pages 659–694, (2024)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

311 Accesses
Explore all metrics

Abstract

Text classification is a very common and important task in Natural Language Processing. In many domains and real-world settings, a few labeled instances are the only resource available to train classifiers. Models trained on small datasets tend to overfit and produce inaccurate results – Data augmentation (DA) techniques come as an alternative to minimize this problem. DA generates synthetic instances that can be fed to the classification algorithm during training. In this article, we explore a variety of DA methods, including back translation, paraphrasing, and text generation. We assess the impact of the DA methods over simulated low-data scenarios using well-known public datasets in English with classifiers built fine-tuning BERT models. We describe the means to adapt these DA methods to augment a small Portuguese dataset containing tweets labeled with smart city dimensions (e.g., transportation, energy, water, etc.). Our experiments showed that some classes were noticeably improved by DA – with an improvement of 43% in terms of F1 compared to the baseline with no augmentation. In a qualitative analysis, we observed that the DA methods were able to preserve the label but failed to preserve the semantics in some cases and that generative models were able to produce high-quality synthetic instances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Social media analytics: a survey of techniques, tools and platforms

Article Open access 26 July 2014

An analysis of large language models: their impact and potential applications

Article 11 May 2024

MOSS: An Open Conversational Large Language Model

Article 20 May 2024

Notes

References

Alammar, J. (2019). The illustrated gpt-2 (visualizing transformer language models). http://jalammar.github.io/. Retrieved from http://jalammar.github.io/illustrated-gpt2/.
Amjad, M., Sidorov, G., & Zhila, A. (2020). Data augmentation using machine translation for fake news detection in the urdu language. In Proceedings of the 12th language resources and evaluation conference (pp. 2537-2542).
Anaby-Tavor, A., Carmeli, B., Goldbraich, E., Kantor, A., Kour, G., Shlomov, S., Tepper N., & Zwerdling, N. (2020). Do not have enough data? deep learning to the rescue! In Proceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 7383-7390).
Beddiar, D. R., Jahan, M. S., & Oussalah, M. (2021). Data expansion using back translation and paraphrasing for hate speech detection. Online Social Networks and Media, 24, 100153.
Article Google Scholar
Bencke, L., Cechinel, C., & Munoz, R. (2020). Automated classification of social network messages into smart cities dimensions. Future Generation Computer Systems, 109, 218–237.
Article Google Scholar
Body, T., Tao, X., Li, Y., Li, L., & Zhong, N. (2021). Using back-and-forth translation to create artificial augmented textual data for sentiment analysis models. Expert Systems with Applications, 178, 115033.
Article Google Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. Advances in neural information processing systems,33, 1877–1901.
Chen, J., Tam, D., Raffel, C., Bansal, M., & Yang, D. (2023). An empirical survey of data augmentation for limited data learning in nlp. Transactions of the Association for Computational Linguistics, 11, 191–211.
Article Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacl-hlt (pp. 4171-4186).
Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding backtranslation at scale. Computation and Language. https://doi.org/10.48550/arXiv.1808.09381
Article Google Scholar
Fan, A., Lewis, M., & Dauphin, Y. (2018). Hierarchical neural story generation. Computation and Language. https://doi.org/10.48550/arXiv.1805.04833
Article Google Scholar
Feng, S.Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., Hovy, E. (2021). A survey of data augmentation approaches for NLP. Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 968-988). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.84 https://doi.org/10.18653/v1/2021.findings-acl.84
Fenogenova, A. (2021). Russian paraphrasers: Paraphrase with transformers. In Proceedings of the 8th workshop on balto-slavic natural language processing (pp. 11-19).
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200), 675–701.
Article Google Scholar
García-Palomares, J. C., Salas-Olmedo, M. H., Moya-Gomez, B., Condeco-Melhorado, A., & Gutierrez, J. (2018). City dynamics through twitter: Relationships between land use and spatiotemporal demographics. Cities, 72(310), 319.
Google Scholar
Garg, S., & Ramakrishnan, G. (2020). Bae: Bert-based adversarial examples for text classification. Computation and Language. https://doi.org/10.48550/arXiv.2004.01970
Article Google Scholar
Gulli, A. (2005). Ag’s corpus of news articles.
Haralabopoulos, G., Torres, M. T., Anagnostopoulos, I., & McAuley, D. (2021). Text data augmentations: Permutation, antonyms and negation. Expert Systems with Applications, 177, 114769.
Article Google Scholar
Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., & Klakow, D. (2020). A survey on recent approaches for natural language processing in low-resource scenarios. Computation and Language. https://doi.org/10.48550/arXiv.2010.12309
Article Google Scholar
Herdağdelen, A. (2013). Twitter n-gram corpus with demographic metadata. Language Resources and Evaluation, 47(4), 1127–1147.
Article Google Scholar
Hernández-García, A., & König, P. (2018). Data augmentation instead of explicit regularization. Computer Vision and Pattern Recognition. https://doi.org/10.48550/arXiv.1806.03852
Article Google Scholar
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). The curious case of neural text degeneration. Computation and Language. https://doi.org/10.48550/arXiv.1904.09751
Article Google Scholar
ISO (2014). Iso 37120:2014 - sustainable development of communities – indicators for city services and quality of life (Tech. Rep.). International Organization for Standardization.
Kim, H.H., Woo, D., Oh, S.J., Cha, J.-W., & Han, Y.-S. (2022). Alp: Data augmentation using lexicalized pcfgs for few-shot text classification. In Proceedings of the aaai conference on artificial intelligence (Vol. 36, pp. 10894-10902).
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Machine Learning. https://doi.org/10.48550/arXiv.1412.6980
Article Google Scholar
Kobayashi, S. (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations. Computation and Language. https://doi.org/10.48550/arXiv.1805.06201
Article Google Scholar
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35.
Article Google Scholar
Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. Machine Learning. https://doi.org/10.48550/arXiv.1711.05101
Article Google Scholar
Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. Springer International Publishing.
Google Scholar
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2021). Deep learning-based text classification: A comprehensive review. ACM Computing Surveys (CSUR), 54(3), 1–40.
Article Google Scholar
Moreno-Barea, F. J., Jerez, J. M., & Franco, L. (2020). Improving classification accuracy using data augmentation on small data sets. Expert Systems with Applications, 161, 113696.
Article Google Scholar
Okur, E., Sahay, S., & Nachman, L. (2022). Data augmentation with paraphrase generation and entity extraction for multimodal dialogue system. In Proceedings of the thirteenth language resources and evaluation conference (pp. 4114-4125). Marseille, France: European Language Resources Association. Retrieved from https://aclanthology.org/2022.lrec-1.437
OpenAI (2023). Gpt-4 technical report.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., &Schulman, J. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). Challenges in deploying machine learning: a survey of case studies. ACM Computing Surveys, 55(6), 1–29.
Article Google Scholar
Pla, F., & Hurtado, L.-F. (2018). Spanish sentiment analysis in twitter at the tass workshop. Language Resources and Evaluation, 52(2), 645–672.
Article Google Scholar
Puri, M., Varde, A. S., & de Melo, G. (2023). Commonsense based text mining on urban policy. Language Resources and Evaluation, 57(2), 733–763.
Article Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Google Scholar
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. ACL: Proceed. conf. on empirical methods in nlp.
Google Scholar
Rosenthal, S., Farra, N., & Nakov, P. (2017). SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of the 11th international workshop on semantic evaluation (semeval-2017).
Sennrich, R., Haddow, B., & Birch, A. (2015). Improving neural machine translation models with monolingual data. Computation and Language. https://doi.org/10.48550/arXiv.1511.06709
Article Google Scholar
Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2021). Text data augmentation for deep learning. Journal of big Data, 8(1), 1–34.
Article Google Scholar
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).
Souza, F., Nogueira, R., & Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. Brazilian conference on intelligent systems (pp. 403-417).
TUWIEN, Technische Universitat Wien (2015). European Smart Cities Model. http://www.smart-cities.eu/. (Online; accessed 06 January 2019).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems (pp. 5998-6008).
Wagner Filho, J.A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese.In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018).
Wang, S., Xu, R., Liu, Y., Zhu, C., & Zeng, M. (2022). Paratag: A dataset of paraphrase tagging for fine-grained labels, nlg evaluation, and data augmentation.Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 7111-7122).
Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 6383-6389). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/D19-1670
Witteveen, S., & Andrews, M. (2019). Paraphrasing with large language models. Computation and Language. https://doi.org/10.48550/arXiv.1911.09661
Article Google Scholar
Wu, X., Lv, S., Zang, L., Han, J., Hu, S. (2019). Conditional bert contextual augmentation. International conference on computational science (pp. 84-95).
Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., & Le, Q. V. (2020). Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33, 6256–6268.
Google Scholar
Yoo, K. M., Park, D., Kang, J., Lee, S.-W., & Park, W. (2021). Gpt3mix: Leveraging large-scale language models for text augmentation. Computation and Language. https://doi.org/10.48550/arXiv.2104.08826
Article Google Scholar
Zhang, J., Zhao, Y., Saleh, M., Liu, P. (2020). Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. International conference on machine learning (pp. 11328-11339).
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 28, 649–657.
Google Scholar
Zhang, Y., Baldridge, J., He, L. (2019). PAWS: Paraphrase Adversaries from Word Scrambling. Computation and Language.https://doi.org/10.48550/arXiv.1904.01130
Zhao, T., Tang, L., Huang, J., & Fu, X. (2022). Coupled social media content representation for predicting individual socioeconomic status. Expert Systems with Applications, 198, 116744.
Article Google Scholar

Download references

Acknowledgements

This work was partially financed by CAPES Finance code 001.

Author information

Luciana Bencke and Viviane Pereira Moreira have contributed equally to this work.

Authors and Affiliations

Institute of Informatics, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Rio Gande do Sul, Brasil
Luciana Bencke & Viviane Pereira Moreira

Authors

Luciana Bencke
View author publications
You can also search for this author in PubMed Google Scholar
Viviane Pereira Moreira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luciana Bencke.

Ethics declarations

Competing interests

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A detailed results

This section provides detailed results for the experiments described in Sect. 3. In the following tables, each line corresponds to the average of the five seeds, including the number of synthetic instances created. This number can vary due to the removal of duplicated instances generated by the augmentation as described in Sect. 3.3.

See Tables 10, 11 and 12.

Table 10 AG-news low-data results: Number of instances per class and Macro-F1 results for the DA methods considering the Original instances (O), Synthetically Created Instances (S), or a combination of both (S+O)

Full size table

Table 11 SST2 low-data results: Number of instances per class and Macro-F1 results for the DA methods considering the Original instances (O), Synthetically Created Instances (S), or a combination of both (S+O)

Full size table

Table 12 TweetEval low-data results: Number of instances per class and Macro-F1 results for the DA methods considering the Original instances (O), Synthetically Created Instances (S), or a combination of both (S+O)

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bencke, L., Moreira, V.P. Data augmentation strategies to improve text classification: a use case in smart cities. Lang Resources & Evaluation 58, 659–694 (2024). https://doi.org/10.1007/s10579-023-09685-w

Download citation

Accepted: 02 August 2023
Published: 23 August 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s10579-023-09685-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data augmentation strategies to improve text classification: a use case in smart cities

Abstract

Access this article

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

An analysis of large language models: their impact and potential applications

MOSS: An Open Conversational Large Language Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendix A detailed results

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data augmentation strategies to improve text classification: a use case in smart cities

Abstract

Access this article

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

An analysis of large language models: their impact and potential applications

MOSS: An Open Conversational Large Language Model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendix A detailed results

Appendix A detailed results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation