Leveraging Transfer Learning for Long Text Classification with Limited Data

Rocha, Carlos Alberto Alvares; Weigang, Li; Dib, Marcos Vinícius Pinheiro; Faria, Allan Victor Almeida; Cajueiro, Daniel Oliveira; de Melo, Maísa Kely; Celestino, Victor Rafael Rezende

doi:10.1007/978-3-031-43088-6_6

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 494))

Included in the following conference series:

International Conference on Web Information Systems and Technologies

117 Accesses

Abstract

Natural language processing (NLP) has emerged as a significant area of research within the field of artificial intelligence, receiving increased attention in recent years, which has prompted the Brazilian Ministry of Science, Technology, and Innovation to launch a project aimed at finding international funding opportunities for Brazilian researchers through its Research Financing Products Portfolio. However, the challenge of classification in this context is exacerbated by the scarcity of high-quality labeled data, which is a requirement for state-of-the-art NLP implementations. In this study, we employ machine learning strategies to classify long, unstructured, and irregular texts obtained by scraping funding institutions’ websites. Given the limited availability of labeled training data, we adopt an incremental approach to identify a suitable method with optimal performance. In order to alleviate the challenge of data scarcity, we use pre-training technology to learn word context from other data sets with significant similarities and larger scales. Then, we combine transfer learning with deep learning models to enhance sentence comprehension. We also conduct pre-processing experiments to address text irregularities. Comparative analysis with the baseline model reveals that our proposed approach yields promising results, with most trained models achieving over 90% accuracy. Our Longformer + CNN model has achieved 94% accuracy with 100% precision, while the Word2Vec + CNN model has achieved 93.55% accuracy. These findings highlight the successful application of artificial intelligence in public administration.

Supported by Ministry of Science, Technology and Innovation (MCTI).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://huggingface.co/allenai/longformer-base-4096.
2.
Project repository: https://github.com/chap0lin/PPF-MCTI.

References

Ainslie, J., et al.: ETC: encoding long and structured inputs in transformers (2020). https://doi.org/10.48550/ARXIV.2004.08483, https://arxiv.org/abs/2004.08483
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020). https://doi.org/10.48550/arXiv.2004.05150, https://arxiv.org/abs/2004.05150
Brasil: Ministério de ciência, tecnologia e inovações. portfólio de produtos financeiros (2019). https://ppf.mcti.gov.br/
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
van den Bulk, L.M., Bouzembrak, Y., Gavai, A., Liu, N., van den Heuvel, L.J., Marvin, H.J.: Automatic classification of literature in systematic reviews on food safety using machine learning. Curr. Res. Food Sci. 5, 84–95 (2022)
Article Google Scholar
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are strong semi-supervised learners (2020)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
van Dinter, R., Catal, C., Tekinerdogan, B.: A decision support system for automating document retrieval and citation screening. Expert Syst. Appl. 182, 115261 (2021)
Article Google Scholar
Do, C.B., Ng, A.Y.: Transfer learning for text classification. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems, vol. 18. MIT Press (2005). https://proceedings.neurips.cc/paper/2005/file/bf2fb7d1825a1df3ca308ad0bf48591e-Paper.pdf
Fei-Fei, L., Fergus, R., Perona, P.: A Bayesian approach to unsupervised one-shot learning of object categories. In: Proceedings ninth IEEE International Conference on Computer Vision, pp. 1134–1141. IEEE (2003)
Google Scholar
Gron, A.: Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 1st edn. O’Reilly Media Inc, Sebastopol (2017)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics, Doha, Qatar, October 2014. https://doi.org/10.3115/v1/D14-1181, https://aclanthology.org/D14-1181
Kontonatsios, G., Spencer, S., Matthew, P., Korkontzelos, I.: Using a neural network-based feature extraction method to facilitate citation screening for systematic reviews. Expert Syst. Appl. X 6, 100030 (2020)
Google Scholar
Li, J., et al.: Multi-label text classification via hierarchical transformer-CNN. In: 2022 14th International Conference on Machine Learning and Computing (ICMLC). ICMLC 2022, pp. 120–125. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3529836.3529912
McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: Contextualized word vectors. In: NIPS (2017)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. NIPS’13, vol. 2, pp. 3111–3119. Curran Associates Inc., Red Hook, USA (2013)
Google Scholar
Miller, E.G., Matsakis, N.E., Viola, P.A.: Learning from one example through shared densities on transforms. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 1, pp. 464–471 (2000)
Google Scholar
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.: Deep learning-based text classification: a comprehensive review. ACM Comput. Surv. 54(3) (2021). https://doi.org/10.1145/3439726
Pan, S.J., Tsang, I.W.H., Kwok, J.T.Y., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22, 199–210 (2011)
Article Google Scholar
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010). https://doi.org/10.1109/TKDE.2009.191
Article Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar, October 2014. https://doi.org/10.3115/v1/D14-1162, https://aclanthology.org/D14-1162
Peters, M.E., et al.: Deep contextualized word representations (2018). https://doi.org/10.48550/ARXIV.1802.05365, https://arxiv.org/abs/1802.05365
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning. ICML ’07, pp. 759–766. Association for Computing Machinery, New York, USA (2007). https://doi.org/10.1145/1273496.1273592
Rocha., C.A.A., et al.: Using transfer learning to classify long unstructured texts with small amounts of labeled data. In: Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST, pp. 201–213. INSTICC, SciTePress (2022). https://doi.org/10.5220/0011527700003318
Ruder, S., Peters, M.E., Swayamdipta, S., Wolf, T.: Transfer learning in natural language processing. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15–18. Association for Computational Linguistics, Minneapolis, USA, June 2019. https://doi.org/10.18653/v1/N19-5004, https://aclanthology.org/N19-5004
Semberecki, P., Maciejewski, H.: Deep learning methods for subject text classification of articles, pp. 357–360, September 2017. https://doi.org/10.15439/2017F414
Silva, B., Alves, J., Rebeschini, J., Querol, D., Pereira, E., Celestino, V.: Data science applied to financial products portfolio. In: Annals of Meeting of National Association of Post-graduation and Research in Administration (2021)
Google Scholar
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Chapter Google Scholar
Thompson, N.C., Greenewald, K., Lee, K., Manso, G.F.: The computational limits of deep learning (2020). https://doi.org/10.48550/ARXIV.2007.05558, https://arxiv.org/abs/2007.05558
Wang, J., Wang, Z., Zhang, D., Yan, J.: Combining knowledge with deep convolutional neural networks for short text classification, pp. 2915–2921, August 2017. https://doi.org/10.24963/ijcai.2017/406
Weigang, L.: A study of parallel self-organizing map. arXiv preprint quant-ph/9808025 (1998)
Google Scholar
Weigang, L., da Silva, N.C.: A study of parallel neural networks. In: IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No. 99CH36339), vol. 2, pp. 1113–1116. IEEE (1999)
Google Scholar
Xiao, L., Wang, G., Zuo, Y.: Research on patent text classification based on word2vec and LSTM. In: 2018 11th International Symposium on Computational Intelligence and Design (ISCID), vol. 01, pp. 71–74 (2018)
Google Scholar
Zellers, R., et al.: Defending against neural fake news (2019). https://doi.org/10.48550/ARXIV.1905.12616, https://arxiv.org/abs/1905.12616
Zhou, H.: Research of text classification based on TF-IDF and CNN-LSTM. J. Phys. Conf. Ser. J. Phys. Conf. Ser. 2171, 012021 (2022). https://doi.org/10.1088/1742-6596/2171/1/012021
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books (2015). https://doi.org/10.48550/ARXIV.1506.06724, https://arxiv.org/abs/1506.06724

Download references

Acknowledgments

The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has provided partial support for this project. We sincerely thank Dr. Joao Gabriel Souza, who led the efforts in constructing the dataset and graciously shared the data for this study.

Author information

Authors and Affiliations

LAMFO, University of Brasilia, Brasilia, Brazil
Carlos Alberto Alvares Rocha, Li Weigang, Marcos Vinícius Pinheiro Dib, Allan Victor Almeida Faria, Daniel Oliveira Cajueiro, Maísa Kely de Melo & Victor Rafael Rezende Celestino
TransLab, University of Brasilia, Brasilia, Brazil
Li Weigang & Marcos Vinícius Pinheiro Dib
Federal Institute of Minas Gerais, Campus Formiga, Formiga, Brazil
Maísa Kely de Melo

Authors

Carlos Alberto Alvares Rocha
View author publications
You can also search for this author in PubMed Google Scholar
Li Weigang
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Vinícius Pinheiro Dib
View author publications
You can also search for this author in PubMed Google Scholar
Allan Victor Almeida Faria
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Oliveira Cajueiro
View author publications
You can also search for this author in PubMed Google Scholar
Maísa Kely de Melo
View author publications
You can also search for this author in PubMed Google Scholar
Victor Rafael Rezende Celestino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Carlos Alberto Alvares Rocha or Li Weigang .

Editor information

Editors and Affiliations

University of Padua (UNIPD), Padua, Italy
Massimo Marchiori
University of Seville, Seville, Spain
Francisco José Domínguez Mayo
Polytechnic Institute of Setúbal/INSTICC, Setubal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rocha, C.A.A. et al. (2023). Leveraging Transfer Learning for Long Text Classification with Limited Data. In: Marchiori, M., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2022. Lecture Notes in Business Information Processing, vol 494. Springer, Cham. https://doi.org/10.1007/978-3-031-43088-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-43088-6_6
Published: 29 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43087-9
Online ISBN: 978-3-031-43088-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging Transfer Learning for Long Text Classification with Limited Data