Skip to main content
Log in

Data augmentation strategies to improve text classification: a use case in smart cities

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Text classification is a very common and important task in Natural Language Processing. In many domains and real-world settings, a few labeled instances are the only resource available to train classifiers. Models trained on small datasets tend to overfit and produce inaccurate results – Data augmentation (DA) techniques come as an alternative to minimize this problem. DA generates synthetic instances that can be fed to the classification algorithm during training. In this article, we explore a variety of DA methods, including back translation, paraphrasing, and text generation. We assess the impact of the DA methods over simulated low-data scenarios using well-known public datasets in English with classifiers built fine-tuning BERT models. We describe the means to adapt these DA methods to augment a small Portuguese dataset containing tweets labeled with smart city dimensions (e.g., transportation, energy, water, etc.). Our experiments showed that some classes were noticeably improved by DA – with an improvement of 43% in terms of F1 compared to the baseline with no augmentation. In a qualitative analysis, we observed that the DA methods were able to preserve the label but failed to preserve the semantics in some cases and that generative models were able to produce high-quality synthetic instances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/lbencke/DA

  2. https://platform.openai.com/docs/models/gpt-3-5

  3. https://github.com/YJiangcm/SST-2-sentiment-analysis/tree/master/data

  4. https://github.com/nidhaloff/deep-translator

  5. https://github.com/jasonwei20/eda_nlp

  6. https://github.com/makcedward/nlpaug

  7. https://huggingface.co/bert-base-uncased

  8. https://huggingface.co/tuner007/pegasus_paraphrase

  9. https://www.sbert.net/docs/pretrained_models.html#model-overview

  10. https://www.sbert.net/

  11. https://wn.readthedocs.io/en/latest/index.html.

  12. https://platform.openai.com/docs/guides/rate-limits/overview.

  13. GPT3_statistics:https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_word_count.csv.

References

  • Alammar, J. (2019). The illustrated gpt-2 (visualizing transformer language models). http://jalammar.github.io/. Retrieved from http://jalammar.github.io/illustrated-gpt2/.

  • Amjad, M., Sidorov, G., & Zhila, A. (2020). Data augmentation using machine translation for fake news detection in the urdu language. In Proceedings of the 12th language resources and evaluation conference (pp. 2537-2542).

  • Anaby-Tavor, A., Carmeli, B., Goldbraich, E., Kantor, A., Kour, G., Shlomov, S., Tepper N., & Zwerdling, N. (2020). Do not have enough data? deep learning to the rescue! In Proceedings of the aaai conference on artificial intelligence (Vol. 34, pp. 7383-7390).

  • Beddiar, D. R., Jahan, M. S., & Oussalah, M. (2021). Data expansion using back translation and paraphrasing for hate speech detection. Online Social Networks and Media, 24, 100153.

    Article  Google Scholar 

  • Bencke, L., Cechinel, C., & Munoz, R. (2020). Automated classification of social network messages into smart cities dimensions. Future Generation Computer Systems, 109, 218–237.

    Article  Google Scholar 

  • Body, T., Tao, X., Li, Y., Li, L., & Zhong, N. (2021). Using back-and-forth translation to create artificial augmented textual data for sentiment analysis models. Expert Systems with Applications, 178, 115033.

    Article  Google Scholar 

  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. Advances in neural information processing systems,33, 1877–1901.

  • Chen, J., Tam, D., Raffel, C., Bansal, M., & Yang, D. (2023). An empirical survey of data augmentation for limited data learning in nlp. Transactions of the Association for Computational Linguistics, 11, 191–211.

    Article  Google Scholar 

  • Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacl-hlt (pp. 4171-4186).

  • Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding backtranslation at scale. Computation and Language. https://doi.org/10.48550/arXiv.1808.09381

    Article  Google Scholar 

  • Fan, A., Lewis, M., & Dauphin, Y. (2018). Hierarchical neural story generation. Computation and Language. https://doi.org/10.48550/arXiv.1805.04833

    Article  Google Scholar 

  • Feng, S.Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., Hovy, E. (2021). A survey of data augmentation approaches for NLP. Findings of the association for computational linguistics: Acl-ijcnlp 2021 (pp. 968-988). Online: Association for Computational Linguistics. Retrieved from https://aclanthology.org/2021.findings-acl.84https://doi.org/10.18653/v1/2021.findings-acl.84

  • Fenogenova, A. (2021). Russian paraphrasers: Paraphrase with transformers. In Proceedings of the 8th workshop on balto-slavic natural language processing (pp. 11-19).

  • Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200), 675–701.

    Article  Google Scholar 

  • García-Palomares, J. C., Salas-Olmedo, M. H., Moya-Gomez, B., Condeco-Melhorado, A., & Gutierrez, J. (2018). City dynamics through twitter: Relationships between land use and spatiotemporal demographics. Cities, 72(310), 319.

    Google Scholar 

  • Garg, S., & Ramakrishnan, G. (2020). Bae: Bert-based adversarial examples for text classification. Computation and Language. https://doi.org/10.48550/arXiv.2004.01970

    Article  Google Scholar 

  • Gulli, A. (2005). Ag’s corpus of news articles.

  • Haralabopoulos, G., Torres, M. T., Anagnostopoulos, I., & McAuley, D. (2021). Text data augmentations: Permutation, antonyms and negation. Expert Systems with Applications, 177, 114769.

    Article  Google Scholar 

  • Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., & Klakow, D. (2020). A survey on recent approaches for natural language processing in low-resource scenarios. Computation and Language. https://doi.org/10.48550/arXiv.2010.12309

    Article  Google Scholar 

  • Herdağdelen, A. (2013). Twitter n-gram corpus with demographic metadata. Language Resources and Evaluation, 47(4), 1127–1147.

    Article  Google Scholar 

  • Hernández-García, A., & König, P. (2018). Data augmentation instead of explicit regularization. Computer Vision and Pattern Recognition. https://doi.org/10.48550/arXiv.1806.03852

    Article  Google Scholar 

  • Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). The curious case of neural text degeneration. Computation and Language. https://doi.org/10.48550/arXiv.1904.09751

    Article  Google Scholar 

  • ISO (2014). Iso 37120:2014 - sustainable development of communities – indicators for city services and quality of life (Tech. Rep.). International Organization for Standardization.

  • Kim, H.H., Woo, D., Oh, S.J., Cha, J.-W., & Han, Y.-S. (2022). Alp: Data augmentation using lexicalized pcfgs for few-shot text classification. In Proceedings of the aaai conference on artificial intelligence (Vol. 36, pp. 10894-10902).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. Machine Learning. https://doi.org/10.48550/arXiv.1412.6980

    Article  Google Scholar 

  • Kobayashi, S. (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations. Computation and Language. https://doi.org/10.48550/arXiv.1805.06201

    Article  Google Scholar 

  • Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35.

    Article  Google Scholar 

  • Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. Machine Learning. https://doi.org/10.48550/arXiv.1711.05101

    Article  Google Scholar 

  • Manning, C. D., Raghavan, P., & Schütze, H. (2009). An introduction to information retrieval. Springer International Publishing.

    Google Scholar 

  • Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2021). Deep learning-based text classification: A comprehensive review. ACM Computing Surveys (CSUR), 54(3), 1–40.

    Article  Google Scholar 

  • Moreno-Barea, F. J., Jerez, J. M., & Franco, L. (2020). Improving classification accuracy using data augmentation on small data sets. Expert Systems with Applications, 161, 113696.

    Article  Google Scholar 

  • Okur, E., Sahay, S., & Nachman, L. (2022). Data augmentation with paraphrase generation and entity extraction for multimodal dialogue system. In Proceedings of the thirteenth language resources and evaluation conference (pp. 4114-4125). Marseille, France: European Language Resources Association. Retrieved from https://aclanthology.org/2022.lrec-1.437

  • OpenAI (2023). Gpt-4 technical report.

  • Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., &Schulman, J. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.

  • Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). Challenges in deploying machine learning: a survey of case studies. ACM Computing Surveys, 55(6), 1–29.

    Article  Google Scholar 

  • Pla, F., & Hurtado, L.-F. (2018). Spanish sentiment analysis in twitter at the tass workshop. Language Resources and Evaluation, 52(2), 645–672.

    Article  Google Scholar 

  • Puri, M., Varde, A. S., & de Melo, G. (2023). Commonsense based text mining on urban policy. Language Resources and Evaluation, 57(2), 733–763.

    Article  Google Scholar 

  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

    Google Scholar 

  • Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. ACL: Proceed. conf. on empirical methods in nlp.

    Google Scholar 

  • Rosenthal, S., Farra, N., & Nakov, P. (2017). SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of the 11th international workshop on semantic evaluation (semeval-2017).

  • Sennrich, R., Haddow, B., & Birch, A. (2015). Improving neural machine translation models with monolingual data. Computation and Language. https://doi.org/10.48550/arXiv.1511.06709

    Article  Google Scholar 

  • Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2021). Text data augmentation for deep learning. Journal of big Data, 8(1), 1–34.

    Article  Google Scholar 

  • Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631-1642).

  • Souza, F., Nogueira, R., & Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. Brazilian conference on intelligent systems (pp. 403-417).

  • TUWIEN, Technische Universitat Wien (2015). European Smart Cities Model. http://www.smart-cities.eu/. (Online; accessed 06 January 2019).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems (pp. 5998-6008).

  • Wagner Filho, J.A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese.In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018).

  • Wang, S., Xu, R., Liu, Y., Zhu, C., & Zeng, M. (2022). Paratag: A dataset of paraphrase tagging for fine-grained labels, nlg evaluation, and data augmentation.Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 7111-7122).

  • Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 6383-6389). Hong Kong, China: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/D19-1670

  • Witteveen, S., & Andrews, M. (2019). Paraphrasing with large language models. Computation and Language. https://doi.org/10.48550/arXiv.1911.09661

    Article  Google Scholar 

  • Wu, X., Lv, S., Zang, L., Han, J., Hu, S. (2019). Conditional bert contextual augmentation. International conference on computational science (pp. 84-95).

  • Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., & Le, Q. V. (2020). Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33, 6256–6268.

    Google Scholar 

  • Yoo, K. M., Park, D., Kang, J., Lee, S.-W., & Park, W. (2021). Gpt3mix: Leveraging large-scale language models for text augmentation. Computation and Language. https://doi.org/10.48550/arXiv.2104.08826

    Article  Google Scholar 

  • Zhang, J., Zhao, Y., Saleh, M., Liu, P. (2020). Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. International conference on machine learning (pp. 11328-11339).

  • Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, 28, 649–657.

    Google Scholar 

  • Zhang, Y., Baldridge, J., He, L. (2019). PAWS: Paraphrase Adversaries from Word Scrambling. Computation and Language.https://doi.org/10.48550/arXiv.1904.01130

  • Zhao, T., Tang, L., Huang, J., & Fu, X. (2022). Coupled social media content representation for predicting individual socioeconomic status. Expert Systems with Applications, 198, 116744.

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially financed by CAPES Finance code 001.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luciana Bencke.

Ethics declarations

Competing interests

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A detailed results

Appendix A detailed results

This section provides detailed results for the experiments described in Sect. 3. In the following tables, each line corresponds to the average of the five seeds, including the number of synthetic instances created. This number can vary due to the removal of duplicated instances generated by the augmentation as described in Sect. 3.3.

See Tables 10, 11 and 12.

Table 10 AG-news low-data results: Number of instances per class and Macro-F1 results for the DA methods considering the Original instances (O), Synthetically Created Instances (S), or a combination of both (S+O)
Table 11 SST2 low-data results: Number of instances per class and Macro-F1 results for the DA methods considering the Original instances (O), Synthetically Created Instances (S), or a combination of both (S+O)
Table 12 TweetEval low-data results: Number of instances per class and Macro-F1 results for the DA methods considering the Original instances (O), Synthetically Created Instances (S), or a combination of both (S+O)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bencke, L., Moreira, V.P. Data augmentation strategies to improve text classification: a use case in smart cities. Lang Resources & Evaluation 58, 659–694 (2024). https://doi.org/10.1007/s10579-023-09685-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-023-09685-w

Keywords

Navigation