Skip to main content

Leveraging Transfer Learning for Long Text Classification with Limited Data

  • Conference paper
  • First Online:
Web Information Systems and Technologies (WEBIST 2022)

Abstract

Natural language processing (NLP) has emerged as a significant area of research within the field of artificial intelligence, receiving increased attention in recent years, which has prompted the Brazilian Ministry of Science, Technology, and Innovation to launch a project aimed at finding international funding opportunities for Brazilian researchers through its Research Financing Products Portfolio. However, the challenge of classification in this context is exacerbated by the scarcity of high-quality labeled data, which is a requirement for state-of-the-art NLP implementations. In this study, we employ machine learning strategies to classify long, unstructured, and irregular texts obtained by scraping funding institutions’ websites. Given the limited availability of labeled training data, we adopt an incremental approach to identify a suitable method with optimal performance. In order to alleviate the challenge of data scarcity, we use pre-training technology to learn word context from other data sets with significant similarities and larger scales. Then, we combine transfer learning with deep learning models to enhance sentence comprehension. We also conduct pre-processing experiments to address text irregularities. Comparative analysis with the baseline model reveals that our proposed approach yields promising results, with most trained models achieving over 90% accuracy. Our Longformer + CNN model has achieved 94% accuracy with 100% precision, while the Word2Vec + CNN model has achieved 93.55% accuracy. These findings highlight the successful application of artificial intelligence in public administration.

Supported by Ministry of Science, Technology and Innovation (MCTI).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/allenai/longformer-base-4096.

  2. 2.

    Project repository: https://github.com/chap0lin/PPF-MCTI.

References

  1. Ainslie, J., et al.: ETC: encoding long and structured inputs in transformers (2020). https://doi.org/10.48550/ARXIV.2004.08483, https://arxiv.org/abs/2004.08483

  2. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020). https://doi.org/10.48550/arXiv.2004.05150, https://arxiv.org/abs/2004.05150

  3. Brasil: Ministério de ciência, tecnologia e inovações. portfólio de produtos financeiros (2019). https://ppf.mcti.gov.br/

  4. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  5. van den Bulk, L.M., Bouzembrak, Y., Gavai, A., Liu, N., van den Heuvel, L.J., Marvin, H.J.: Automatic classification of literature in systematic reviews on food safety using machine learning. Curr. Res. Food Sci. 5, 84–95 (2022)

    Article  Google Scholar 

  6. Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are strong semi-supervised learners (2020)

    Google Scholar 

  7. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

  8. van Dinter, R., Catal, C., Tekinerdogan, B.: A decision support system for automating document retrieval and citation screening. Expert Syst. Appl. 182, 115261 (2021)

    Article  Google Scholar 

  9. Do, C.B., Ng, A.Y.: Transfer learning for text classification. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems, vol. 18. MIT Press (2005). https://proceedings.neurips.cc/paper/2005/file/bf2fb7d1825a1df3ca308ad0bf48591e-Paper.pdf

  10. Fei-Fei, L., Fergus, R., Perona, P.: A Bayesian approach to unsupervised one-shot learning of object categories. In: Proceedings ninth IEEE International Conference on Computer Vision, pp. 1134–1141. IEEE (2003)

    Google Scholar 

  11. Gron, A.: Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 1st edn. O’Reilly Media Inc, Sebastopol (2017)

    Google Scholar 

  12. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics, Doha, Qatar, October 2014. https://doi.org/10.3115/v1/D14-1181, https://aclanthology.org/D14-1181

  13. Kontonatsios, G., Spencer, S., Matthew, P., Korkontzelos, I.: Using a neural network-based feature extraction method to facilitate citation screening for systematic reviews. Expert Syst. Appl. X 6, 100030 (2020)

    Google Scholar 

  14. Li, J., et al.: Multi-label text classification via hierarchical transformer-CNN. In: 2022 14th International Conference on Machine Learning and Computing (ICMLC). ICMLC 2022, pp. 120–125. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3529836.3529912

  15. McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: Contextualized word vectors. In: NIPS (2017)

    Google Scholar 

  16. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. NIPS’13, vol. 2, pp. 3111–3119. Curran Associates Inc., Red Hook, USA (2013)

    Google Scholar 

  17. Miller, E.G., Matsakis, N.E., Viola, P.A.: Learning from one example through shared densities on transforms. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 1, pp. 464–471 (2000)

    Google Scholar 

  18. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. 54(3) (2021). https://doi.org/10.1145/3439726

  19. Pan, S.J., Tsang, I.W.H., Kwok, J.T.Y., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22, 199–210 (2011)

    Article  Google Scholar 

  20. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010). https://doi.org/10.1109/TKDE.2009.191

    Article  Google Scholar 

  21. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar, October 2014. https://doi.org/10.3115/v1/D14-1162, https://aclanthology.org/D14-1162

  22. Peters, M.E., et al.: Deep contextualized word representations (2018). https://doi.org/10.48550/ARXIV.1802.05365, https://arxiv.org/abs/1802.05365

  23. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  24. Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning. ICML ’07, pp. 759–766. Association for Computing Machinery, New York, USA (2007). https://doi.org/10.1145/1273496.1273592

  25. Rocha., C.A.A., et al.: Using transfer learning to classify long unstructured texts with small amounts of labeled data. In: Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST, pp. 201–213. INSTICC, SciTePress (2022). https://doi.org/10.5220/0011527700003318

  26. Ruder, S., Peters, M.E., Swayamdipta, S., Wolf, T.: Transfer learning in natural language processing. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15–18. Association for Computational Linguistics, Minneapolis, USA, June 2019. https://doi.org/10.18653/v1/N19-5004, https://aclanthology.org/N19-5004

  27. Semberecki, P., Maciejewski, H.: Deep learning methods for subject text classification of articles, pp. 357–360, September 2017. https://doi.org/10.15439/2017F414

  28. Silva, B., Alves, J., Rebeschini, J., Querol, D., Pereira, E., Celestino, V.: Data science applied to financial products portfolio. In: Annals of Meeting of National Association of Post-graduation and Research in Administration (2021)

    Google Scholar 

  29. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16

    Chapter  Google Scholar 

  30. Thompson, N.C., Greenewald, K., Lee, K., Manso, G.F.: The computational limits of deep learning (2020). https://doi.org/10.48550/ARXIV.2007.05558, https://arxiv.org/abs/2007.05558

  31. Wang, J., Wang, Z., Zhang, D., Yan, J.: Combining knowledge with deep convolutional neural networks for short text classification, pp. 2915–2921, August 2017. https://doi.org/10.24963/ijcai.2017/406

  32. Weigang, L.: A study of parallel self-organizing map. arXiv preprint quant-ph/9808025 (1998)

    Google Scholar 

  33. Weigang, L., da Silva, N.C.: A study of parallel neural networks. In: IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No. 99CH36339), vol. 2, pp. 1113–1116. IEEE (1999)

    Google Scholar 

  34. Xiao, L., Wang, G., Zuo, Y.: Research on patent text classification based on word2vec and LSTM. In: 2018 11th International Symposium on Computational Intelligence and Design (ISCID), vol. 01, pp. 71–74 (2018)

    Google Scholar 

  35. Zellers, R., et al.: Defending against neural fake news (2019). https://doi.org/10.48550/ARXIV.1905.12616, https://arxiv.org/abs/1905.12616

  36. Zhou, H.: Research of text classification based on TF-IDF and CNN-LSTM. J. Phys. Conf. Ser. J. Phys. Conf. Ser. 2171, 012021 (2022). https://doi.org/10.1088/1742-6596/2171/1/012021

  37. Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books (2015). https://doi.org/10.48550/ARXIV.1506.06724, https://arxiv.org/abs/1506.06724

Download references

Acknowledgments

The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has provided partial support for this project. We sincerely thank Dr. Joao Gabriel Souza, who led the efforts in constructing the dataset and graciously shared the data for this study.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Carlos Alberto Alvares Rocha or Li Weigang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rocha, C.A.A. et al. (2023). Leveraging Transfer Learning for Long Text Classification with Limited Data. In: Marchiori, M., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2022. Lecture Notes in Business Information Processing, vol 494. Springer, Cham. https://doi.org/10.1007/978-3-031-43088-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43088-6_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43087-9

  • Online ISBN: 978-3-031-43088-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics