ABSTRACT
To address the limitations of traditional heuristic and machine learning-based webpage segmentation algorithms in feature extraction performance and efficiency, we propose a webpage segmentation method based on deep learning object detection. Specifically, we propose a webpage segmentation method named YOLO-WS based on the YOLOv5 model. We optimized and improved the YOLOv5 model’s network structure, loss function, and post-processing for webpage segmentation tasks, and then use transfer learning to train YOLO-WS on the improved model. Experimental results show that YOLO-WS achieves good performance in web page segmentation tasks.
- Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Vips: a vision-based page segmentation algorithm. (2003).Google Scholar
- Michael Cormer, Richard Mann, Karyn Moffatt, and Robin Cohen. [n. d.]. Towards an Improved Vision-Based Web Page Segmentation Algorithm. In 2017 14th Conference on Computer and Robot Vision (CRV) (2017-05). 345–352. https://doi.org/10.1109/CRV.2017.38Google ScholarCross Ref
- Qibin Hou, Daquan Zhou, and Jiashi Feng. [n. d.]. Coordinate Attention for Efficient Mobile Network Design. arxiv:2103.02907 [cs]http://arxiv.org/abs/2103.02907Google Scholar
- Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012, Yonghye Kwon, Kalen Michael, TaoXie, Jiacong Fang, imyhxy, Lorna, 曾逸夫(Zeng Yifu), Colin Wong, Abhiram V, Diego Montes, Zhiqiang Wang, Cristi Fati, Jebastin Nadar, Laughing, UnglvKitDe, Victor Sonck, tkianai, yxNONG, Piotr Skalski, Adam Hogan, Dhruv Nair, Max Strobel, and Mrinal Jain. [n. d.]. ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation. https://doi.org/10.5281/zenodo.7347926Google ScholarCross Ref
- Johannes Kiesel, Florian Kneist, Lars Meyer, Kristof Komlossy, Benno Stein, and Martin Potthast. [n. d.]. Web Page Segmentation Revisited: Evaluation Framework and Dataset. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (New York, NY, USA, 2020-10-19) (CIKM ’20). Association for Computing Machinery, 3047–3054. https://doi.org/10.1145/3340531.3412782Google ScholarDigital Library
- Ningning Ma, Xiangyu Zhang, and Jian Sun. [n. d.]. Funnel Activation for Visual Recognition. https://doi.org/10.48550/arXiv.2007.11824 arxiv:2007.11824 [cs]Google ScholarCross Ref
- Tomohiro Manabe and Keishi Tajima. [n. d.]. Extracting logical hierarchical structure of HTML documents based on headings. 8, 12 ([n. d.]), 1606–1617. https://doi.org/10.14778/2824032.2824058Google ScholarDigital Library
- Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter W. J. Staar. [n. d.]. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022-08-14). 3743–3751. https://doi.org/10.1145/3534678.3539043 arxiv:2206.01062 [cs]Google ScholarDigital Library
- Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. [n. d.]. Weighted boxes fusion: Ensembling boxes from different object detection models. 107 ([n. d.]), 104117. https://doi.org/10.1016/j.imavis.2021.104117Google ScholarCross Ref
- Dongxian Wu, Yisen Wang, Shu-Tao Xia, James Bailey, and Xingjun Ma. [n. d.]. Skip Connections Matter: On the Transferability of Adversarial Examples Generated with ResNets. arxiv:2002.05990 [cs, stat]http://arxiv.org/abs/2002.05990Google Scholar
- Yi-Fan Zhang, Weiqiang Ren, Zhang Zhang, Zhen Jia, Liang Wang, and Tieniu Tan. [n. d.]. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. arxiv:2101.08158 [cs]http://arxiv.org/abs/2101.08158Google Scholar
Index Terms
- YOLO-WS: A Novel Method for Webpage Segmentation
Recommendations
Enhancement of Flash Webpage Segmentation for Web Mining Applications
ICONIAAC '14: Proceedings of the 2014 International Conference on Interdisciplinary Advances in Applied ComputingWeb page segmentation is a crucial step for many applications like information retrieval, text classification, noise removal, full text search and automatic page adaptation can benefit from this structure. In literature, many methods have been proposed ...
Constructing Novel Block Layouts for Webpage Analysis
Special Section on Advances in Internet-Based Collaborative TechnologiesWebpage segmentation is the basic building block for a wide range of webpage analysis methods. The rapid development of Web technologies results in more dynamic and complex webpages, which bring new challenges to this area. To improve the performance of ...
A Brain Tumor Segmentation New Method Based on Statistical Thresholding and Multiscale CNN
Intelligent Computing Methodologies
Comments