research-article

YOLO-WS: A Novel Method for Webpage Segmentation

Authors:
Li Dai

College of Software, Xinjiang University, China

College of Software, Xinjiang University, China

0009-0003-9444-4357
View Profile

,
Zunwang Ke

College of Software, Xinjiang University, China

College of Software, Xinjiang University, China

0000-0002-2589-8377
View Profile

,
Wushour Silamu

College of Information Science and Engineering, Xinjiang University, China

College of Information Science and Engineering, Xinjiang University, China

0000-0003-4592-7806
View Profile

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of ThingsMay 2023Pages 451–456https://doi.org/10.1145/3603781.3603862

Published:27 July 2023Publication History

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things

Pages 451–456

ABSTRACT

To address the limitations of traditional heuristic and machine learning-based webpage segmentation algorithms in feature extraction performance and efficiency, we propose a webpage segmentation method based on deep learning object detection. Specifically, we propose a webpage segmentation method named YOLO-WS based on the YOLOv5 model. We optimized and improved the YOLOv5 model’s network structure, loss function, and post-processing for webpage segmentation tasks, and then use transfer learning to train YOLO-WS on the improved model. Experimental results show that YOLO-WS achieves good performance in web page segmentation tasks.

References

Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. Vips: a vision-based page segmentation algorithm. (2003).Google Scholar
Michael Cormer, Richard Mann, Karyn Moffatt, and Robin Cohen. [n. d.]. Towards an Improved Vision-Based Web Page Segmentation Algorithm. In 2017 14th Conference on Computer and Robot Vision (CRV) (2017-05). 345–352. https://doi.org/10.1109/CRV.2017.38Google ScholarCross Ref
Qibin Hou, Daquan Zhou, and Jiashi Feng. [n. d.]. Coordinate Attention for Efficient Mobile Network Design. arxiv:2103.02907 [cs]http://arxiv.org/abs/2103.02907Google Scholar
Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012, Yonghye Kwon, Kalen Michael, TaoXie, Jiacong Fang, imyhxy, Lorna, 曾逸夫(Zeng Yifu), Colin Wong, Abhiram V, Diego Montes, Zhiqiang Wang, Cristi Fati, Jebastin Nadar, Laughing, UnglvKitDe, Victor Sonck, tkianai, yxNONG, Piotr Skalski, Adam Hogan, Dhruv Nair, Max Strobel, and Mrinal Jain. [n. d.]. ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation. https://doi.org/10.5281/zenodo.7347926Google ScholarCross Ref
Johannes Kiesel, Florian Kneist, Lars Meyer, Kristof Komlossy, Benno Stein, and Martin Potthast. [n. d.]. Web Page Segmentation Revisited: Evaluation Framework and Dataset. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (New York, NY, USA, 2020-10-19) (CIKM ’20). Association for Computing Machinery, 3047–3054. https://doi.org/10.1145/3340531.3412782Google ScholarDigital Library
Ningning Ma, Xiangyu Zhang, and Jian Sun. [n. d.]. Funnel Activation for Visual Recognition. https://doi.org/10.48550/arXiv.2007.11824 arxiv:2007.11824 [cs]Google ScholarCross Ref
Tomohiro Manabe and Keishi Tajima. [n. d.]. Extracting logical hierarchical structure of HTML documents based on headings. 8, 12 ([n. d.]), 1606–1617. https://doi.org/10.14778/2824032.2824058Google ScholarDigital Library
Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter W. J. Staar. [n. d.]. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022-08-14). 3743–3751. https://doi.org/10.1145/3534678.3539043 arxiv:2206.01062 [cs]Google ScholarDigital Library
Roman Solovyev, Weimin Wang, and Tatiana Gabruseva. [n. d.]. Weighted boxes fusion: Ensembling boxes from different object detection models. 107 ([n. d.]), 104117. https://doi.org/10.1016/j.imavis.2021.104117Google ScholarCross Ref
Dongxian Wu, Yisen Wang, Shu-Tao Xia, James Bailey, and Xingjun Ma. [n. d.]. Skip Connections Matter: On the Transferability of Adversarial Examples Generated with ResNets. arxiv:2002.05990 [cs, stat]http://arxiv.org/abs/2002.05990Google Scholar
Yi-Fan Zhang, Weiqiang Ren, Zhang Zhang, Zhen Jia, Liang Wang, and Tieniu Tan. [n. d.]. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. arxiv:2101.08158 [cs]http://arxiv.org/abs/2101.08158Google Scholar

Index Terms

YOLO-WS: A Novel Method for Webpage Segmentation
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
2. Information systems
  1. World Wide Web
    1. Web mining
      1. Data extraction and integration

Recommendations

Enhancement of Flash Webpage Segmentation for Web Mining Applications
ICONIAAC '14: Proceedings of the 2014 International Conference on Interdisciplinary Advances in Applied Computing

Web page segmentation is a crucial step for many applications like information retrieval, text classification, noise removal, full text search and automatic page adaptation can benefit from this structure. In literature, many methods have been proposed ...
Read More
Constructing Novel Block Layouts for Webpage Analysis
Special Section on Advances in Internet-Based Collaborative Technologies

Webpage segmentation is the basic building block for a wide range of webpage analysis methods. The rapid development of Web technologies results in more dynamic and complex webpages, which bring new challenges to this area. To improve the performance of ...
Read More
A Brain Tumor Segmentation New Method Based on Statistical Thresholding and Multiscale CNN
Intelligent Computing Methodologies
Abstract
Brain tumor segmentation is crucial in the diagnosis of disease and radiation therapy. However, automatic or semi-automatic segmentation of the brain tumor is still a challenging task due to the high diversities and the ambiguous boundaries in the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things
May 2023
1025 pages
ISBN:9798400700705
DOI:10.1145/3603781

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 July 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
document layout analysis
object detection
webpage segmentation
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate39of82submissions,48%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 42
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

YOLO-WS: A Novel Method for Webpage Segmentation

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enhancement of Flash Webpage Segmentation for Web Mining Applications

Constructing Novel Block Layouts for Webpage Analysis

A Brain Tumor Segmentation New Method Based on Statistical Thresholding and Multiscale CNN

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

YOLO-WS: A Novel Method for Webpage Segmentation

CNIOT '23: Proceedings of the 2023 4th International Conference on Computing, Networks and Internet of Things

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enhancement of Flash Webpage Segmentation for Web Mining Applications

Constructing Novel Block Layouts for Webpage Analysis

A Brain Tumor Segmentation New Method Based on Statistical Thresholding and Multiscale CNN

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media