Abstract
Website classification is a crucial task in various applications such as web search, content filtering, and recommendation systems. Effectively categorizing long web pages into different categories based on their content is essential for providing accurate and personalized user experiences. Traditional transformer-based models, such as BERT and RoBERTa, have significantly advanced the field of natural language processing. However, such models face limitations when handling long sequences due to their fixed-length input restrictions resulting from their quadratic complexity. This paper presents a simple weighted stratified split approach (WSSA), to address the limitations of BERT and RoBERTa, in processing long text sequences for website classification. WSSA consists into chunking web pages into smaller chunks, then a new train chunk dataset is generated by a weighted stratified split following the distribution of the categories in the whole chunk dataset. This train chunk dataset is then used to train the models. Our approach improves the accuracy of BERT and RoBERTa models, surpassing the performance of Longformer and BigBird models. The proposed solution enables efficient processing and data augmentation, with reasonable fine-tuning times for BERT and RoBERTa models. Inference times remain efficient, showcasing the practicality of these models in real-time website classification tasks. The combination of WSSA with the index web page performs exceptionally well, highlighting its effectiveness in addressing the long text sequence limitation and improving transformer-based models for website classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
https://www.olfeo.com/ (visited on: 07/07/2022).
- 2.
References
Bartík, V.: Text-based web page classification with use of visual information. In: 2010 International Conference on Advances in Social Networks Analysis and Mining, pp. 416–420. IEEE (2010)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020)
Choromanski, K., et al.: Rethinking attention with performers. arXiv preprint arXiv:2009.14794 (2020)
Cochran, W.G.: The comparison of percentages in matched samples. Biometrika 37(3/4), 256–266 (1950)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V.: Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Espinosa-Leal, L., Akusok, A., Lendasse, A., Björk, K.-M.: Website classification from webpage renders. In: Cao, J., Vong, C.M., Miche, Y., Lendasse, A. (eds.) ELM 2019. PALO, vol. 14, pp. 41–50. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-58989-9_5
Janaki Meena, M., Chandran, K., Karthik, A., Vijay Samuel, A.: A parallel ACO algorithm to select terms to categorise longer documents. Int. J. Comput. Sci. Eng. 6(4), 238–248 (2011)
Kitaev, N., Kaiser, u., Levskaya, A.: Adaptive attention span in transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2911–2922 (2020)
Kitaev, N., Kaiser, u., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020 (2020)
Kumar, J., Santhanavijayan, A., Janet, B., Rajendran, B., Bindhumadhava, B.: Phishing website classification and detection using machine learning. In: 2020 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–6 (2020). https://doi.org/10.1109/ICCCI48352.2020.9104161
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley (2014)
Liu, Y., et al.: ROBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019). http://arxiv.org/abs/1907.11692
Meena, M.J., Chandran, K., Karthik, A., Samuel, A.V.: An enhanced ACO algorithm to select features for text categorization and its parallelization. Exp. Syst. Appl. 39(5), 5861–5871 (2012)
Mohammad, R.M., Thabtah, F., McCluskey, L.: Intelligent rule-based phishing websites classification. IET Inf. Secur. 8(3), 153–160 (2014)
Özel, S.A.: A web page classification system based on a genetic algorithm using tagged-terms as features. Exp. Syst. Appl. 38(4), 3407–3415 (2011)
Panwar, A., Onut, I.-V., Miller, J.: Towards real time contextual advertising. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014. LNCS, vol. 8787, pp. 445–459. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11746-1_33
Qi, X., Davison, B.D.: Knowing a web page by the company it keeps. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 228–237 (2006)
Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 1–31 (2009)
Qiu, J., Ma, H., Levy, O., Yih, W., Wang, S., Tang, J.: Blockwise self-attention for long document understanding (2019)
Rae, J.W., Potapenko, A., Jayakumar, S.M., Lillicrap, T.P.: Compressive transformers for long-range sequence modelling. In: International Conference on Learning Representations (ICLR) (2020)
Reitermanova, Z.: Data splitting. In: WDS, vol. 10, pp. 31–36. MatfyzPress, Prague (2010)
Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient content-based sparse attention with routing transformers. arXiv preprint arXiv:2003.05997 (2020)
Shabudin, S., Sani, N.S., Ariffin, K.A.Z., Aliff, M.: Feature selection for phishing website classification. Int. J. Adv. Comput. Sci. Appl. 11(4) (2020)
Vaghela, S.D., Patel, P.: Web page classification techniques - a comprehensive survey. IJIRSET 6, 17472–17479 (2014)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, S., Li, Z., Khabsa, M., Fang, H., Ma, H., Tang, J.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
Ye, Z., Guo, Q., Gan, Q., Qiu, X., Zhang, Z.: BP-transformer: modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070 (2019)
Zaheer, M., et al.: Big bird: transformers for longer sequences (2020)
Zhong, S., Zou, D.: Web page classification using an ensemble of support vector machine classifiers. J. Netw. 6(11), 1625 (2011)
Acknowledgement
This work is part of the RAPID project METIS which was funded by the French Ministry of the Armed Forces, Defence Innovation Agency (Reference number: 202906117).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Koufi, M.Z., Guessoum, Z., Keziou, A., Yahiaoui, I., Martineau, C., Domin, W. (2024). Text Chunking to Improve Website Classification. In: Pereira, A.I., Mendes, A., Fernandes, F.P., Pacheco, M.F., Coelho, J.P., Lima, J. (eds) Optimization, Learning Algorithms and Applications. OL2A 2023. Communications in Computer and Information Science, vol 1981. Springer, Cham. https://doi.org/10.1007/978-3-031-53025-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-53025-8_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53024-1
Online ISBN: 978-3-031-53025-8
eBook Packages: Computer ScienceComputer Science (R0)