research-article

Open Access

A New Annotation Method and Dataset for Layout Analysis of Long Documents

Authors:
Aman Ahuja

Department of Computer Science, Virginia Tech, USA

Department of Computer Science, Virginia Tech, USA

0009-0002-8491-0193
View Profile

,
Kevin Dinh

Virginia Tech, USA

Virginia Tech, USA

0009-0000-0307-5532
View Profile

,
Brian Dinh

Virginia Tech, USA

Virginia Tech, USA

0009-0008-3552-2067
View Profile

,
William A. Ingram

Virginia Tech, USA

Virginia Tech, USA

0000-0002-8307-8844
View Profile

,
Edward Fox

Virginia Tech, USA

Virginia Tech, USA

0000-0003-1447-6870
View Profile

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023April 2023Pages 834–842https://doi.org/10.1145/3543873.3587609

Published:30 April 2023Publication History

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

Pages 834–842

ABSTRACT

Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD [1], our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories.

References

Aman Ahuja, Alan Devera, and Edward Alan Fox. 2022. Parsing Electronic Theses and Dissertations Using Object Detection. In Proceedings of the first Workshop on Information Extraction from Scientific Publications. Association for Computational Linguistics, 121–130. https://aclanthology.org/2022.wiesp-1.14Google Scholar
Frank Le Bourgeois, Zbigniew Bublinski, and Hubert Emptoz. 1992. A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents. In 11th IAPR International Conference on Pattern Recognition, ICPR 1992. Conference B: Pattern Recognition Methodology and Systems, The Hague, Netherlands, August 30-September 3, 1992. IEEE, 272–276. https://doi.org/10.1109/ICPR.1992.201771Google ScholarCross Ref
Kevin Dinh, Brian Dinh, Andrew Leavitt, and Annie Tran. 2022. Object Detection. http://hdl.handle.net/10919/114082 Virginia Tech CS4624 team term project.Google Scholar
Ross B. Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 1440–1448. https://doi.org/10.1109/ICCV.2015.169Google ScholarDigital Library
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 4083–4091. https://doi.org/10.1145/3503161.3548112Google ScholarDigital Library
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1516–1520. https://doi.org/10.1109/ICDAR.2019.00244Google ScholarCross Ref
Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In 2nd International Workshop on Open Services and Tools for Document Analysis, OSTICDAR 2019, Sydney, Australia, September 22-25, 2019. IEEE, 1–6. https://doi.org/10.1109/ICDARW.2019.10029Google ScholarCross Ref
Sampanna Yashwant Kahu, William A. Ingram, Edward A. Fox, and Jian Wu. 2021. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. In ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021, Champaign, IL, USA, September 27-30, 2021. IEEE, 180–191. https://doi.org/10.1109/JCDL52503.2021.00030Google ScholarCross Ref
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020. European Language Resources Association, 1918–1925. https://aclanthology.org/2020.lrec-1.236/Google Scholar
Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020. 949–960. https://doi.org/10.18653/v1/2020.coling-main.82Google ScholarCross Ref
Patrice Lopez and et al.2008–2022. GROBID. https://github.com/kermitt2/grobid.Google Scholar
Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Workshop on Document Intelligence at NeurIPS 2019. https://openreview.net/forum?id=SJl3z659UHGoogle Scholar
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. 91–99. https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.htmlGoogle ScholarDigital Library
Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. 2021. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I(Lecture Notes in Computer Science, Vol. 12821). Springer, 131–146. https://doi.org/10.1007/978-3-030-86549-8_9Google ScholarDigital Library
Tomasz Stanislawek, Filip Gralinski, Anna Wróblewska, Dawid Lipinski, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, and Przemyslaw Biecek. 2021. Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I(Lecture Notes in Computer Science, Vol. 12821). Springer, 564–579. https://doi.org/10.1007/978-3-030-86549-8_36Google ScholarDigital Library
Sami Uddin, Bipasha Banerjee, Jian Wu, William A. Ingram, and Edward A. Fox. 2021. Building A Large Collection of Multi-domain Electronic Theses and Dissertations. In 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, December 15-18, 2021. IEEE, 6043–6045. https://doi.org/10.1109/BigData52589.2021.9672058Google ScholarCross Ref
Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2022. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. CoRR abs/2207.02696 (2022). https://doi.org/10.48550/arXiv.2207.02696 arXiv:2207.02696Google ScholarCross Ref
Papers with Code. 2022. Real-Time Object Detection on COCO. https://paperswithcode.com/sota/real-time-object-detection-on-cocoGoogle Scholar
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. ACM, 1192–1200. https://doi.org/10.1145/3394486.3403172Google ScholarDigital Library
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei A. F. Florêncio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021. Association for Computational Linguistics, 2579–2591. https://doi.org/10.18653/v1/2021.acl-long.201Google ScholarCross Ref
Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. 2019. PubLayNet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1015–1022. https://doi.org/10.1109/ICDAR.2019.00166Google ScholarCross Ref
Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, and Sandeep Tata. 2022. Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models. CoRR abs/2210.16391 (2022). https://doi.org/10.48550/arXiv.2210.16391 arXiv:2210.16391Google ScholarCross Ref
Kecheng Zhu, Zachary Gager, Shelby Neal, Jiangyue Li, and You Peng. 2022. Object Detection. http://hdl.handle.net/10919/109979 Virginia Tech CS4624 team term project.Google Scholar

Index Terms

A New Annotation Method and Dataset for Layout Analysis of Long Documents

Recommendations

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at ...
Read More
Automatic Knowledge Base Construction from Scholarly Documents
DocEng '17: Proceedings of the 2017 ACM Symposium on Document Engineering

The continuing growth of published scholarly content on the web ensures the availability of the most recent scientific findings to researchers. Scholarly documents, such as research articles, are easily accessed by using academic search engines that are ...
Read More
PDF-VQA: A New Dataset for Real-World VQA on PDF Documents
Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track
Abstract
Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions. We proposed a new document-based VQA dataset, PDF-VQA, to comprehensively examine the document ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023
April 2023
1567 pages
ISBN:9781450394192
DOI:10.1145/3543873
Editors:
Ying Ding,
Jie Tang,
Juan Sequeda,
Lora Aroyo,
Carlos Castillo,
Geert-Jan Houben
Copyright © 2023 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 April 2023
Check for updates
Author Tags
AI-Aided
Document Understanding
Electronic Theses and Dissertations
Object Detection
Scholarly Documents
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 445
  Total Downloads
- Downloads (Last 12 months)441
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A New Annotation Method and Dataset for Layout Analysis of Long Documents

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation

Automatic Knowledge Base Construction from Scholarly Documents

PDF-VQA: A New Dataset for Real-World VQA on PDF Documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

A New Annotation Method and Dataset for Layout Analysis of Long Documents

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation

Automatic Knowledge Base Construction from Scholarly Documents

PDF-VQA: A New Dataset for Real-World VQA on PDF Documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media