Algorithms for extracting lines, paragraphs with their properties in PDF documents

. The article discusses the algorithms for detecting and extracting lines, paragraphs with their properties and attributes in PDF documents, analyses the structure of PDF-file and its objects. Due to special operators in objects the PDF documents content is saved as symbols or symbol groups. The position of such groups on the page also remains identical. The main challenge that we face, while extracting paragraphs from the PDF document is the complex format that is able to retain various types of information and can be created in several ways.


Introduction
Portable Document Format (PDF) -this format has been developed by Adobe as the final format for the text documents. This implies that it will stay intact, no matter what version of the text editor we use. [1]. That is why this is the perfect choice, if you want to transfer, distribute or publish any document. Originally, the editing tools were quite limited. However, nowadays there are advanced tools such as tagging [2], but they are not as widespread as the original ones.
Most scientific papers such as research projects reports, graduation theses, etc. are kept and distributed as PDF text documents. For, example, from 2013 to 2023 in ITMO University have been written 17300 graduation theses. Moreover, most scientific papers that have been published in e-journals are also made in PDF. The latest MIME type detection research (July 2021), stated in CommonCrawl database confirms that the PDF format ranks third in the rating of the most widespread formats on the Internet, (only HTML and XHTML are higher); PDF has left such formats JPEG, PNG or GIF [3] behind PDF has become so popular owning to several features: ─ No matter what software we use for viewing, document structure and formatting remain intact.
─ You can easily create and view a document. ─ Excellent compression tools set. ─ This format is safe. The text extraction and segmentation tools, different ways of image processing and table detachment have been thoroughly analyzed in the following works: "Caradoc: a pragmatic approach to PDF parsing and validation» [4], "Extraction and visualization of citation relationships and its attributes for papers in PDF» [5].

Related work
Many PDF parsing programs enable users to extract text and images only. Some of them can help you to extract partial text properties. Nevertheless, none of the libraries is appropriate for the check of structural elements accuracy in the document. However, some solutions can be employed to obtain PDF document properties and objects: PyPDF2 [6], pdfminer [7], slate [8] at Python perform syntactic analysis. They can help to extract text, images and metadata from PDF files.
PDFBox [9], iText [10] are Java libraries, which also can be used to extract text and images from PDF files.
Poppler at C++ [11] allow to extract text, images and metadata from PDF files. Besides, it is well optimized and efficient.
PDF data extraction has been studied by many experts. Nowadays there are several approaches to the issue. They are: A graph-based syntactic analysis. It has been considered in paper [12], where the author proposes a graph-based analysis of the text items, by grouping page elements according to edge weights. This method shows good results in extracting text, fonts, headers and identifying interline interval size for paragraphs [13].
Machine learning analysis. Article [14] dwells on convolution neural network and machine learning methods, discussing table detection possibilities for PDF documents. Some authors also propose ways of images [15,16] and content extraction [17] based on machine learning methods. However, when we deal with PDF document structure and properties, we should treat these methods as complementary ones, because there is a high risk of error in such cases.
Mathematical methods. These methods are based on the extraction and analysis of blocks. The following article [18], in its turn, analyses Markov model and the ways it can be employed for bibliographic data with PDFBox library (for text and font size data extraction, in particular).
Moreover, there are several programs that can help to extract formulae from a PDF document. The algorithms and methods employed in such programs can be used to improve parsing algorithms for PDF documents [19,20].
The following paper [21] is of a particular interest. The authors extract text blocks and classify them. Their results can be quite handy at the next stage, when it is necessary to perform PDF documents accuracy check.
The analysis of the scientific papers in question reveals the following conclusions: value base deterministic algorithms are more efficient for extraction of structural elements and their properties; various machine-learning methods are appropriate for images and formulae extraction; machine-learning methods can be used to check, if the algorithms are functioning properly. It is necessary to use text size values, text block values and consider distance between different structural elements to create proper algorithms for detachment and extraction of lines, paragraphs with their properties and attributes in PDF documents.

PDF document structure
A PDF document is a binary file, which contains the whole information. This is the key feature that distinguishes it from docx and odt files, where the information is splitted into several files that contain XML-like scheme [22].
Any PDF document consists of components. They are displayed in Fig. 1 [4]. Header contains metadata on PDF specification version for a particular document. Body comprise all text and graphic data of the document. Xref table is a key distinctive feature of this format. It comprises data on all objects of any page or any figure and their position in the document.
All the text and graphic data of the document is kept in its body by means of objects. These objects have their own ID. For example, there is a special object that describes the data on a certain page in the document. It refers to another object that describes the content of this page, etc. There are special operators that describe data in objects. Some of them are stated in Table 1. It should be noted that font data is always kept in a separate object. Text content in PDF document is exposed as symbols or a symbol groups, along with their position on the page.
Thus, it can be concluded that a PDF document contains data on symbols, their location, fonts, point size and rendering type.

Issues, associated with paragraphs extraction from a PDF document
The analysis of PDF document structure has revealed some technical issues that may impede the process of text paragraphs extraction.
First of all, PDF retains text data symbol-by-symbol, in contrast to Open Office XML (DOCX) or OpenDocument Text (ODT), where each element is represented by a set of styles, mainly as a whole paragraph. Thus, DOCX and ODT documents provide an opportunity to extract the whole paragraph along with the styles that belong to it. But for PDF this opportunity doesn't exist. Here lies the main difficulty, associated with the extraction of such documents and their attributes.
Secondly, text in tables is identical to the text of the whole document, as the table is the scope of graphic objects -squares, and the text is kept as usual. Thus, table in PDF is not unified.
Thirdly, if the reporting documents formatting rules are not complied, it is difficult to identify the beginning and end of paragraphs in the text.
Due to the fact that the PDF content is mainly represented by symbols and vector graphic data, the issues mentioned above should be resolved from the point of view of human perception. In other words, the method of differentiation of structural elements in PDF document is based on the following similarity: while reading the report a human can identify separate paragraphs and other structural elements of a document. Having analyzed the given example, the authors of this paper have developed algorithms to identify text lines and paragraphs that will be described below.

PDF document parsing review
PDF document parsing consists of several stages.
First stage -to detach all the information from the document, including: list of pages, symbols, their location at page, point size, fonts, figures, various graphic objects, etc.
The second stage is divided into two sub-stages, done in parallel: To unite symbols by the position in line.
To extract graphs.
To extract tables and text in them from all pages in document. The third stage is to delete from the list text lines that repeat the table content. The fourth stage is to calculate interline interval for all lines, if two lines are at different pages, it is deemed to be 0.
At the previous stages the system has gained all the necessary data and made calculations that are vital for further text paragraphs detection from the list of lines The last stage -to detect text paragraphs based on previously gained information.

Algorithms for making a line list from symbols
The algorithm for making a line depends on the information about the position of each symbol at a page. The cycle processes all symbols at a page to determine, if the position of the current symbol at y-axis coincides with the position of the previous one. If the positions are the same, the symbol is added to line, if they are different, the line is added to the list and current symbol starts a new line.
Coordinate gird system in PDF document starts from the bottom of the left page edge. Thus, the lesser values reference axes have, the lower is the position of the object at page (and the closer to the left the object is).
The full algorithm for creating lines is shown at block-diagram in Fig. 2. The reporting documents may contain numeration and page headers/footers. Such elements are always at the top of the page in content sequence. They may interfere with the calculations that are essential to proper system functioning. To prevent this the indenture for y-axis should be 40 topographic points. It should also be verified, whether the line contains numbers only. If the condition is met, this line should be missed.

The algorithm for creating text paragraphs
As was already mentioned, the algorithm for text paragraphs detection considers the way in which human perceives paragraphs in text documents. Our team has identified several factors that determine the choice: Differences in interline interval between paragraphs. Differences in indenture. The first line of a new paragraph is known to begin with the indenture.
The last line in paragraph usually doesn't reach the end of the page. Based on these factors our team has devised an algorithm, shown in Fig. 3.  The study has defined the best value for the indent of the last paragraph line from the right edge of the A4 page. That value amounts to 520 topographic points. It can be adjusted depending on field values for the document pages [23].
Considering that some lines may have capital symbols, it was decided that the acceptable error range for the differences in interline interval is equal to 2 points.

Algorithms of creating attributes for lines and paragraphs
To perform documents adequacy check it is necessary to assign values to paragraph properties and attributes. We have developed the following algorithm to detect them on Fig.  5. The algorithm works as follows. When symbols are combined into lines, the coordinates of the first and the last symbol in line are recorded. Based on this data the algorithm reproduces the position of the whole line. The system searches for the font and point size of the current symbol in corresponding lists to identify, if there have been any changes, when we go to the next symbol in line. If such font or point size doesn't exist, it is added to the list. If the list contains more than one item, when all the symbols in line have been analyzed, then such font or point size gets an attribute that informs about its heterogeneity.
The procedure of creating attributes for paragraphs is mainly the same with the only exception: the search is performed in line list. Along with the font point size and position, the value of indenture appears in the text.

Discussion
The algorithm was unable to correctly process figure captions consisting of a number of lines due to center alignment, which revealed approximately 7.4% of all errors. Errors were also found in the transition of the title to a normal paragraph, tk. row positions matched.
There were errors also found when changing the name of a picture to a regular paragraph, because row positions matched.

Conclusions
The article has developed algorithms for combining symbols from PDF document into lines and for paragraphs detection. When paragraphs are detected, they get their own properties and attributes. They inherit them both from symbols and line elements. However, in contrast to tagged formats of texts documents there are relatively few extracted properties: number of symbols, special symbols and words; font and point size; italics; bold type; interline interval (space); indenture.
These algorithms can be further used to automatize the accuracy check for such PDF documents as reports on graduation theses, research works, theses, etc.
The aims of further papers: To resolve the issues, identified during testing.
To improve the accuracy of paragraphs detection, to employ machine learning algorithms (possible).
To develop formulae extraction algorithm.
To modernize the algorithm for highlighting lists in a document for different design cases/.