Prediction of Author’s Profile basing on Fine-Tuning BERT model

Bassem Bsir; Nabil Khoufi; Mounir Zrigui

doi:10.31449/inf.v48i1.4839

Prediction of Author’s Profile basing on Fine-Tuning BERT model

Bassem Bsir, Nabil Khoufi, Mounir Zrigui

Abstract

The task of author profiling consists in specifying the infer-demographic features’ of the social networks’ users by studying their published content or the interactions between them. In the literature, many research works were conducted to enhance the accuracy of the techniques used in this process. In fact, the existing methods can be divided into two types: simple linear mod-els and complex deep neural network models. Among them, the transformer-based model exhibited the highest efficiency in NLP analysis in several lan-guages (English, German, French, Turk, Arabic, etc.). Despite their good per-formance, these approaches do not cover author profiling analysis and, thus, should be further enhanced. So, we propose in this paper a new deep learning strategy by training a customized transformer-model to learn the optimal fea-tures of our dataset. In this direction, we fine-tune the model by using the trans-fer learning approach to improve the results with random initialization. We have achieved about 79% of accuracy by modifying model to apply the retrain-ing process using PAN 2018 authorship dataset.

Full Text:

PDF

References

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019.

Ai, M.: BERT for Russian news clustering. In: Proceedings of the International Confer-ence “Dialogue 2021”, p. 6. Moscow, Russia.2021.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language under-standing by generative pre-training,” URL https://s3- us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language_ understanding_paper. pdf, 2018.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi-tors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.2017.

A.Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. Available: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.

Alvarez-Carmona, M. A., L ´ opez-Monroy, A. P., Montes- ´ y Gomez, M., Villase ´ nor-Pineda, L., and Meza, I. ˜ (2016). Evaluating topic-based representations for author profil-ing in social media. In Montes y Gomez, M.

Antoun, Wissam, Fady Baly, and Hazem Hajj. "Arabert: Transformer-based model for arabic language understanding." arXiv preprint arXiv:2003.00104 (2020).

Bernard, G.: Resources to compute TF-IDF weightings on press articles and tweets (2022). https://doi.org/10.5281/zenodo.6610406.2022

Butt, S., Ashraf, N., Sidorov, G., & Gelbukh, A. F. (2021, September). Sexism Identifica-tion using BERT and Data Augmentation-EXIST2021. In IberLEF@ SEPLN (pp. 381-389).2021.

Catelli, R., Pelosi, S., & Esposito, M. (2022). Lexicon-based vs. Bert-based sentiment ana-lysis: A comparative study in Italian. Electronics, 11(3), 374.2022.

C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune BERT for text classification?” in China National Conference on Chinese Computational Linguistics. Springer, 2019, pp. 194–206.

González-Gallardo, C. E., Montes, A., Sierra, G., Núnez-Juárez, J. A., Salinas-López, A. J., & Ek, J. (2015, September). Tweets Classification using Corpus Dependent Tags, Char-acter and POS N-grams. In CLEF (working notes).2015

Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune BERT for text classification? In China National Conference on Chinese Computational Linguistics, pages 194–206. Springer, 2019.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter / Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf // arXiv preprint arXiv:1910.01108. –– 2019.

Estruch, C. P., Paredes, R. Rosso, P., Learning Multimodal Gender Profile using Neural Networks. Rec Adv Nat Language Process 2017; Varna: Bulgaria, pp: 577-582. Available from: http://users.dsic.upv.es/∼prosso/resources/PerezEtAl_RANLP17.pdf.

F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, and G. Inches. “Overview of the author profiling task at PAN 2013”. CLEF 2013 Evaluation Labs and Workshop – Working Notes Papers, 2013.

F. Rangel, P. Rosso, I. Chugur, M. Potthast, M. Trenkmann, B. Stein, B. Verhoeven, and W. Daelemans. “Overview of the 2nd author profiling task at PAN 2014”. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, 2014.

F. Rangel, F. Celli, P. Rosso, M. Potthast, B. Stein, and W. Daelemans. “Overview of the 3rd author profiling task at PAN 2015”. CLEF 2015 Evaluation Labs and Workshop – Working Notes Papers, 2015.

F. Rangel, M. Francisco, P. Rosso, B. Verhoeven, W. Daelemans, M. Potthast, Martin, and B. Stein. “Overview of the 4th author profiling task at PAN 2016: Cross-Genre Evalua-tions”. Working Notes Papers of the CLEF 2016 Evaluation Labs, 2016.

F. Rangel, M. Francisco, P. Rosso, M. Potthast, and B. Stein. “Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in Twitter”. Work-ing Notes Papers of the CLEF 2017 Evaluation Labs, 2017.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. 2019.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirec-tional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis-tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirec-tional transformers for language understanding,” NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf., vol. 1, no. Mlm, pp. 4171–4186, 2019.

J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” ACL 2018 - 56th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap., vol. 1, pp. 328–339, 2018, doi: 10.18653/v1/p18-1031.

Haffar N., Ayadi R., Hkiri E., Zrigui M. (2021) Temporal Ordering of Events via Deep Neural Networks. In: Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and

Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science, vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_49.

Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Arch-er. 2019. Small and Practical BERT Models for Sequence Labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3632–3636, Hong Kong, China. Association for Computational Linguistics.2019.

Kim, J., Aum, J., Lee, S., Jang, Y., Park, E., Choi, D.: FibVID: comprehensive fake news diffusion dataset during the COVID-19 period. Telemat. Inform. 64, 101688 (2021). https://doi.org/10.1016/j.tele.2021.101688.2022.

Matej Martinc, Iza Skrjanec, Katja Zupan, and Senja Pollak. Pan 2017: Author profiling-gender ˇ and language variety prediction. Cappellato et al.[13], 2017.

Martinc, M., Škrjanec, I., Zupan, K., & Pollak, S. Pan 2017: Author profilinggender and language variety prediction. CLEF (Working Notes) 2017. CEUR Workshop Pro ceedings 1866, CEUR-WS.org (2017).

R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.

Suman, C., Kumar, P., Saha, S., & Bhattacharyya, P. (2019, December). Gender Age and Dialect Recognition using Tweets in a Deep Learning Framework-Notebook for FIRE 2019. In FIRE (Working Notes) (pp. 160-166).

S. Garg, T. Vu, and A. Moschitti, “TANDA: Transfer and adapt pre-trained transformer models for answer sentence selection,” arXiv preprint arXiv:1911.04118, 2019.

Schlicht, I. B., & de Paula, A. F. M. (2021). Unified and multilingual author profiling for detecting haters. arXiv preprint arXiv:2109.09233.

Velankar, A., Patil, H., & Joshi, R. (2022, November). Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi. In Artificial Neural Net-works in Pattern Recognition: 10th IAPR TC3 Workshop, ANNPR 2022, Dubai, United Arab Emirates, November 24–26, 2022, Proceedings (pp. 121-128). Cham: Springer Inter-national Publishing.2002.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: general-ized autoregressive pretraining for language understanding. arXiv:1906.08237 [cs] (2020)

Yasuhide Miura, Tomoki Taniguchi, Motoki Taniguchi, and Tomoko Ohkuma. Author profiling with word+ character neural attention network. Cappellato et al.[13].2017.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar S. Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly opti-mized bert pretraining approach. ArXiv, abs/1907.11692, 2019.

Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou, and X. Zhou, “Semantics-aware BERT for language understanding,” arXiv preprint arXiv:1909.02209, 2019.

DOI: https://doi.org/10.31449/inf.v48i1.4839

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me