ABSTRACT
Parsing long documents, such as books, theses, and dissertations, is an important component of information extraction from scholarly documents. Layout analysis methods based on object detection have been developed in recent years to help with PDF document parsing. However, several challenges hinder the adoption of such methods for scholarly documents such as theses and dissertations. These include (a) the manual effort and resources required to annotate training datasets, (b) the scanned nature of many documents and the inherent noise present resulting from the capture process, and (c) the imbalanced distribution of various types of elements in the documents. In this paper, we address some of the challenges related to object detection based layout analysis for scholarly long documents. First, we propose an AI-aided annotation method to help develop training datasets for object detection based layout analysis. This leverages the knowledge of existing trained models to help human annotators, thus reducing the time required for annotation. It also addresses the class imbalance problem, guiding annotators to focus on labeling instances of rare classes. We also introduce ETD-ODv2, a novel dataset for object detection on electronic theses and dissertations (ETDs). In addition to the page images included in ETD-OD [1], our dataset consists of more than 16K manually annotated page images originating from 100 scanned ETDs, along with annotations for 20K page images primarily consisting of rare classes that were labeled using the proposed framework. The new dataset thus covers a diversity of document types, viz., scanned and born-digital, and is better balanced in terms of training samples from different object categories.
- Aman Ahuja, Alan Devera, and Edward Alan Fox. 2022. Parsing Electronic Theses and Dissertations Using Object Detection. In Proceedings of the first Workshop on Information Extraction from Scientific Publications. Association for Computational Linguistics, 121–130. https://aclanthology.org/2022.wiesp-1.14Google Scholar
- Frank Le Bourgeois, Zbigniew Bublinski, and Hubert Emptoz. 1992. A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents. In 11th IAPR International Conference on Pattern Recognition, ICPR 1992. Conference B: Pattern Recognition Methodology and Systems, The Hague, Netherlands, August 30-September 3, 1992. IEEE, 272–276. https://doi.org/10.1109/ICPR.1992.201771Google ScholarCross Ref
- Kevin Dinh, Brian Dinh, Andrew Leavitt, and Annie Tran. 2022. Object Detection. http://hdl.handle.net/10919/114082 Virginia Tech CS4624 team term project.Google Scholar
- Ross B. Girshick. 2015. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 1440–1448. https://doi.org/10.1109/ICCV.2015.169Google ScholarDigital Library
- Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022. ACM, 4083–4091. https://doi.org/10.1145/3503161.3548112Google ScholarDigital Library
- Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1516–1520. https://doi.org/10.1109/ICDAR.2019.00244Google ScholarCross Ref
- Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In 2nd International Workshop on Open Services and Tools for Document Analysis, OSTICDAR 2019, Sydney, Australia, September 22-25, 2019. IEEE, 1–6. https://doi.org/10.1109/ICDARW.2019.10029Google ScholarCross Ref
- Sampanna Yashwant Kahu, William A. Ingram, Edward A. Fox, and Jian Wu. 2021. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. In ACM/IEEE Joint Conference on Digital Libraries, JCDL 2021, Champaign, IL, USA, September 27-30, 2021. IEEE, 180–191. https://doi.org/10.1109/JCDL52503.2021.00030Google ScholarCross Ref
- Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020. European Language Resources Association, 1918–1925. https://aclanthology.org/2020.lrec-1.236/Google Scholar
- Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020. 949–960. https://doi.org/10.18653/v1/2020.coling-main.82Google ScholarCross Ref
- Patrice Lopez and et al.2008–2022. GROBID. https://github.com/kermitt2/grobid.Google Scholar
- Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Workshop on Document Intelligence at NeurIPS 2019. https://openreview.net/forum?id=SJl3z659UHGoogle Scholar
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. 91–99. https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.htmlGoogle ScholarDigital Library
- Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. 2021. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I(Lecture Notes in Computer Science, Vol. 12821). Springer, 131–146. https://doi.org/10.1007/978-3-030-86549-8_9Google ScholarDigital Library
- Tomasz Stanislawek, Filip Gralinski, Anna Wróblewska, Dawid Lipinski, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, and Przemyslaw Biecek. 2021. Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part I(Lecture Notes in Computer Science, Vol. 12821). Springer, 564–579. https://doi.org/10.1007/978-3-030-86549-8_36Google ScholarDigital Library
- Sami Uddin, Bipasha Banerjee, Jian Wu, William A. Ingram, and Edward A. Fox. 2021. Building A Large Collection of Multi-domain Electronic Theses and Dissertations. In 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, December 15-18, 2021. IEEE, 6043–6045. https://doi.org/10.1109/BigData52589.2021.9672058Google ScholarCross Ref
- Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2022. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. CoRR abs/2207.02696 (2022). https://doi.org/10.48550/arXiv.2207.02696 arXiv:2207.02696Google ScholarCross Ref
- Papers with Code. 2022. Real-Time Object Detection on COCO. https://paperswithcode.com/sota/real-time-object-detection-on-cocoGoogle Scholar
- Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. ACM, 1192–1200. https://doi.org/10.1145/3394486.3403172Google ScholarDigital Library
- Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei A. F. Florêncio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021. Association for Computational Linguistics, 2579–2591. https://doi.org/10.18653/v1/2021.acl-long.201Google ScholarCross Ref
- Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. 2019. PubLayNet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1015–1022. https://doi.org/10.1109/ICDAR.2019.00166Google ScholarCross Ref
- Yichao Zhou, James B. Wendt, Navneet Potti, Jing Xie, and Sandeep Tata. 2022. Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models. CoRR abs/2210.16391 (2022). https://doi.org/10.48550/arXiv.2210.16391 arXiv:2210.16391Google ScholarCross Ref
- Kecheng Zhu, Zachary Gager, Shelby Neal, Jiangyue Li, and You Peng. 2022. Object Detection. http://hdl.handle.net/10919/109979 Virginia Tech CS4624 team term project.Google Scholar
Index Terms
- A New Annotation Method and Dataset for Layout Analysis of Long Documents
Recommendations
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data MiningAccurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at ...
Automatic Knowledge Base Construction from Scholarly Documents
DocEng '17: Proceedings of the 2017 ACM Symposium on Document EngineeringThe continuing growth of published scholarly content on the web ensures the availability of the most recent scientific findings to researchers. Scholarly documents, such as research articles, are easily accessed by using academic search engines that are ...
PDF-VQA: A New Dataset for Real-World VQA on PDF Documents
Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo TrackAbstractDocument-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions. We proposed a new document-based VQA dataset, PDF-VQA, to comprehensively examine the document ...
Comments