Text Chunking to Improve Website Classification

Koufi, Mohamed Zohir; Guessoum, Zahia; Keziou, Amor; Yahiaoui, Itheri; Martineau, Chloé; Domin, Wandrille

doi:10.1007/978-3-031-53025-8_15

Mohamed Zohir Koufi^11,13,
Zahia Guessoum¹¹,
Amor Keziou¹²,
Itheri Yahiaoui¹¹,
Chloé Martineau¹³ &
…
Wandrille Domin¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1981))

Included in the following conference series:

International Conference on Optimization, Learning Algorithms and Applications

168 Accesses

Abstract

Website classification is a crucial task in various applications such as web search, content filtering, and recommendation systems. Effectively categorizing long web pages into different categories based on their content is essential for providing accurate and personalized user experiences. Traditional transformer-based models, such as BERT and RoBERTa, have significantly advanced the field of natural language processing. However, such models face limitations when handling long sequences due to their fixed-length input restrictions resulting from their quadratic complexity. This paper presents a simple weighted stratified split approach (WSSA), to address the limitations of BERT and RoBERTa, in processing long text sequences for website classification. WSSA consists into chunking web pages into smaller chunks, then a new train chunk dataset is generated by a weighted stratified split following the distribution of the categories in the whole chunk dataset. This train chunk dataset is then used to train the models. Our approach improves the accuracy of BERT and RoBERTa models, surpassing the performance of Longformer and BigBird models. The proposed solution enables efficient processing and data augmentation, with reasonable fine-tuning times for BERT and RoBERTa models. Inference times remain efficient, showcasing the practicality of these models in real-time website classification tasks. The combination of WSSA with the index web page performs exceptionally well, highlighting its effectiveness in addressing the long text sequence limitation and improving transformer-based models for website classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.olfeo.com/ (visited on: 07/07/2022).
2.
https://romeo.univ-reims.fr/.

References

Bartík, V.: Text-based web page classification with use of visual information. In: 2010 International Conference on Advances in Social Networks Analysis and Mining, pp. 416–420. IEEE (2010)
Google Scholar
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020)
Google Scholar
Choromanski, K., et al.: Rethinking attention with performers. arXiv preprint arXiv:2009.14794 (2020)
Cochran, W.G.: The comparison of percentages in matched samples. Biometrika 37(3/4), 256–266 (1950)
Article MathSciNet Google Scholar
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V.: Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988 (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Espinosa-Leal, L., Akusok, A., Lendasse, A., Björk, K.-M.: Website classification from webpage renders. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 41–50. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58989-9_5
Chapter Google Scholar
Janaki Meena, M., Chandran, K., Karthik, A., Vijay Samuel, A.: A parallel ACO algorithm to select terms to categorise longer documents. Int. J. Comput. Sci. Eng. 6(4), 238–248 (2011)
Google Scholar
Kitaev, N., Kaiser, u., Levskaya, A.: Adaptive attention span in transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2911–2922 (2020)
Google Scholar
Kitaev, N., Kaiser, u., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020 (2020)
Google Scholar
Kumar, J., Santhanavijayan, A., Janet, B., Rajendran, B., Bindhumadhava, B.: Phishing website classification and detection using machine learning. In: 2020 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–6 (2020). https://doi.org/10.1109/ICCCI48352.2020.9104161
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley (2014)
Google Scholar
Liu, Y., et al.: ROBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). http://arxiv.org/abs/1907.11692
Meena, M.J., Chandran, K., Karthik, A., Samuel, A.V.: An enhanced ACO algorithm to select features for text categorization and its parallelization. Exp. Syst. Appl. 39(5), 5861–5871 (2012)
Article Google Scholar
Mohammad, R.M., Thabtah, F., McCluskey, L.: Intelligent rule-based phishing websites classification. IET Inf. Secur. 8(3), 153–160 (2014)
Article Google Scholar
Özel, S.A.: A web page classification system based on a genetic algorithm using tagged-terms as features. Exp. Syst. Appl. 38(4), 3407–3415 (2011)
Article Google Scholar
Panwar, A., Onut, I.-V., Miller, J.: Towards real time contextual advertising. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014. LNCS, vol. 8787, pp. 445–459. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11746-1_33
Chapter Google Scholar
Qi, X., Davison, B.D.: Knowing a web page by the company it keeps. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 228–237 (2006)
Google Scholar
Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 1–31 (2009)
Article Google Scholar
Qiu, J., Ma, H., Levy, O., Yih, W., Wang, S., Tang, J.: Blockwise self-attention for long document understanding (2019)
Google Scholar
Rae, J.W., Potapenko, A., Jayakumar, S.M., Lillicrap, T.P.: Compressive transformers for long-range sequence modelling. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Reitermanova, Z.: Data splitting. In: WDS, vol. 10, pp. 31–36. MatfyzPress, Prague (2010)
Google Scholar
Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient content-based sparse attention with routing transformers. arXiv preprint arXiv:2003.05997 (2020)
Shabudin, S., Sani, N.S., Ariffin, K.A.Z., Aliff, M.: Feature selection for phishing website classification. Int. J. Adv. Comput. Sci. Appl. 11(4) (2020)
Google Scholar
Vaghela, S.D., Patel, P.: Web page classification techniques - a comprehensive survey. IJIRSET 6, 17472–17479 (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, S., Li, Z., Khabsa, M., Fang, H., Ma, H., Tang, J.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
Ye, Z., Guo, Q., Gan, Q., Qiu, X., Zhang, Z.: BP-transformer: modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070 (2019)
Zaheer, M., et al.: Big bird: transformers for longer sequences (2020)
Google Scholar
Zhong, S., Zou, D.: Web page classification using an ensemble of support vector machine classifiers. J. Netw. 6(11), 1625 (2011)
Google Scholar

Download references

Acknowledgement

This work is part of the RAPID project METIS which was funded by the French Ministry of the Armed Forces, Defence Innovation Agency (Reference number: 202906117).

Author information

Authors and Affiliations

CReSTIC, EA 3804, University of Reims Champagne-Ardenne, Reims, France
Mohamed Zohir Koufi, Zahia Guessoum & Itheri Yahiaoui
LMR - UMR9008, University of Reims Champagne-Ardenne, Reims, France
Amor Keziou
Olfeo, Paris, France
Mohamed Zohir Koufi, Chloé Martineau & Wandrille Domin

Authors

Mohamed Zohir Koufi
View author publications
You can also search for this author in PubMed Google Scholar
Zahia Guessoum
View author publications
You can also search for this author in PubMed Google Scholar
Amor Keziou
View author publications
You can also search for this author in PubMed Google Scholar
Itheri Yahiaoui
View author publications
You can also search for this author in PubMed Google Scholar
Chloé Martineau
View author publications
You can also search for this author in PubMed Google Scholar
Wandrille Domin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Zohir Koufi .

Editor information

Editors and Affiliations

Instituto Politécnico de Bragança, Bragança, Portugal
Ana I. Pereira
University of Azores, Ponta Delgada, Portugal
Armando Mendes
Instituto Politécnico de Bragança, Bragança, Portugal
Florbela P. Fernandes
Instituto Politécnico de Bragança, Bragança, Portugal
Maria F. Pacheco
Instituto Politécnico de Bragança, Bragança, Portugal
João P. Coelho
Instituto Politécnico de Bragança, Bragança, Portugal
José Lima

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koufi, M.Z., Guessoum, Z., Keziou, A., Yahiaoui, I., Martineau, C., Domin, W. (2024). Text Chunking to Improve Website Classification. In: Pereira, A.I., Mendes, A., Fernandes, F.P., Pacheco, M.F., Coelho, J.P., Lima, J. (eds) Optimization, Learning Algorithms and Applications. OL2A 2023. Communications in Computer and Information Science, vol 1981. Springer, Cham. https://doi.org/10.1007/978-3-031-53025-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-53025-8_15
Published: 01 February 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53024-1
Online ISBN: 978-3-031-53025-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Text Chunking to Improve Website Classification