skip to main content
10.1145/3543873.3587609acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Open Access

A New Annotation Method and Dataset for Layout Analysis of Long Documents

Published:30 April 2023Publication History

ABSTRACT

Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD [1], our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories.

References

  1. Aman Ahuja, Alan Devera, and Edward Alan Fox. 2022. Parsing Electronic Theses and Dissertations Using Object Detection. In Proceedings of the first Workshop on Information Extraction from Scientific Publications. Association for Computational Linguistics, 121–130. https://aclanthology.org/2022.wiesp-1.14Google ScholarGoogle Scholar
  2. Frank Le Bourgeois, Zbigniew Bublinski, and Hubert Emptoz. 1992. A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents. In 11th IAPR International Conference on Pattern Recognition, ICPR 1992. Conference B: Pattern Recognition Methodology and Systems, The Hague, Netherlands, August 30-September 3, 1992. IEEE, 272–276. https://doi.org/10.1109/ICPR.1992.201771Google ScholarGoogle ScholarCross RefCross Ref
  3. Kevin Dinh, Brian Dinh, Andrew Leavitt, and Annie Tran. 2022. Object Detection. http://hdl.handle.net/10919/114082 Virginia Tech CS4624 team term project.Google ScholarGoogle Scholar
  4. Ross B. Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 1440–1448. https://doi.org/10.1109/ICCV.2015.169Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 4083–4091. https://doi.org/10.1145/3503161.3548112Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1516–1520. https://doi.org/10.1109/ICDAR.2019.00244Google ScholarGoogle ScholarCross RefCross Ref
  7. Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In 2nd International Workshop on Open Services and Tools for Document Analysis, OSTICDAR 2019, Sydney, Australia, September 22-25, 2019. IEEE, 1–6. https://doi.org/10.1109/ICDARW.2019.10029Google ScholarGoogle ScholarCross RefCross Ref
  8. Sampanna Yashwant Kahu, William A. Ingram, Edward A. Fox, and Jian Wu. 2021. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. In ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021, Champaign, IL, USA, September 27-30, 2021. IEEE, 180–191. https://doi.org/10.1109/JCDL52503.2021.00030Google ScholarGoogle ScholarCross RefCross Ref
  9. Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020. European Language Resources Association, 1918–1925. https://aclanthology.org/2020.lrec-1.236/Google ScholarGoogle Scholar
  10. Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020. 949–960. https://doi.org/10.18653/v1/2020.coling-main.82Google ScholarGoogle ScholarCross RefCross Ref
  11. Patrice Lopez and et al.2008–2022. GROBID. https://github.com/kermitt2/grobid.Google ScholarGoogle Scholar
  12. Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Workshop on Document Intelligence at NeurIPS 2019. https://openreview.net/forum?id=SJl3z659UHGoogle ScholarGoogle Scholar
  13. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. 91–99. https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. 2021. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I(Lecture Notes in Computer Science, Vol. 12821). Springer, 131–146. https://doi.org/10.1007/978-3-030-86549-8_9Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Tomasz Stanislawek, Filip Gralinski, Anna Wróblewska, Dawid Lipinski, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, and Przemyslaw Biecek. 2021. Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I(Lecture Notes in Computer Science, Vol. 12821). Springer, 564–579. https://doi.org/10.1007/978-3-030-86549-8_36Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sami Uddin, Bipasha Banerjee, Jian Wu, William A. Ingram, and Edward A. Fox. 2021. Building A Large Collection of Multi-domain Electronic Theses and Dissertations. In 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, December 15-18, 2021. IEEE, 6043–6045. https://doi.org/10.1109/BigData52589.2021.9672058Google ScholarGoogle ScholarCross RefCross Ref
  17. Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2022. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. CoRR abs/2207.02696 (2022). https://doi.org/10.48550/arXiv.2207.02696 arXiv:2207.02696Google ScholarGoogle ScholarCross RefCross Ref
  18. Papers with Code. 2022. Real-Time Object Detection on COCO. https://paperswithcode.com/sota/real-time-object-detection-on-cocoGoogle ScholarGoogle Scholar
  19. Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. ACM, 1192–1200. https://doi.org/10.1145/3394486.3403172Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei A. F. Florêncio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021. Association for Computational Linguistics, 2579–2591. https://doi.org/10.18653/v1/2021.acl-long.201Google ScholarGoogle ScholarCross RefCross Ref
  21. Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. 2019. PubLayNet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1015–1022. https://doi.org/10.1109/ICDAR.2019.00166Google ScholarGoogle ScholarCross RefCross Ref
  22. Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, and Sandeep Tata. 2022. Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models. CoRR abs/2210.16391 (2022). https://doi.org/10.48550/arXiv.2210.16391 arXiv:2210.16391Google ScholarGoogle ScholarCross RefCross Ref
  23. Kecheng Zhu, Zachary Gager, Shelby Neal, Jiangyue Li, and You Peng. 2022. Object Detection. http://hdl.handle.net/10919/109979 Virginia Tech CS4624 team term project.Google ScholarGoogle Scholar

Index Terms

  1. A New Annotation Method and Dataset for Layout Analysis of Long Documents

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format