Consistent Partition and Labelling of Text Blocks

Liang, J.; Phillips, I. T.; Haralick, R. M.

doi:10.1007/s100440070023

Consistent Partition and Labelling of Text Blocks

Published: June 2000

Volume 3, pages 196–208, (2000)
Cite this article

Pattern Analysis & Applications Aims and scope Submit manuscript

J. Liang¹,
I. T. Phillips² &
R. M. Haralick³

57 Accesses
6 Citations
Explore all metrics

Abstract:

This paper presents a text block extraction algorithm that takes as its input a set of text lines of a given document, and partitions the text lines into a set of text blocks, where each text block is associated with a set of homogeneous formatting attributes, e.g. text-alignment, indentation. The text block extraction algorithm described in this paper is probability based. We adopt an engineering approach to systematically characterising the text block structures based on a large document image database, and develop statistical methods to extract the text block structures from the image. All the probabilities are estimated from an extensive training set of various kinds of measurements among the text lines, and among the text blocks in the training data set. The off-line probabilities estimated in the training then drive all decisions in the on-line text block extraction. An iterative, relaxation-like method is used to find the partitioning solution that maximizes the joint probability. To evaluate the performance of our text block extraction algorithm, we used a three-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-III database of some 1600 scanned document image pages. The text block extraction algorithm identifies and segments 91% of text blocks correctly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

MathSoft, Inc., Seattle, WA, USA, , , , , , US
J. Liang
Department of Computer Science/Software Engineering, Seattle University, Seattle, WA, USA, , , , , , US
I. T. Phillips
Department of Electrical Engineering, University of Washington, Seattle, WA, USA, , , , , , US
R. M. Haralick

Authors

J. Liang
View author publications
You can also search for this author in PubMed Google Scholar
I. T. Phillips
View author publications
You can also search for this author in PubMed Google Scholar
R. M. Haralick
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liang, J., Phillips, I. & Haralick, R. Consistent Partition and Labelling of Text Blocks. Pattern Analysis & Applications 3, 196–208 (2000). https://doi.org/10.1007/s100440070023

Download citation

Issue Date: June 2000
DOI: https://doi.org/10.1007/s100440070023

Keywords:Document structure; Hidden Markov Model; Layout analysis; Statistical-based; Text block extraction; UW-III database

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Consistent Partition and Labelling of Text Blocks

Abstract:

Access this article

Similar content being viewed by others

Text Segmentation for Document Recognition

Page Segmentation Techniques in Document Analysis

Connected Operators for Non-text Object Segmentation in Grayscale Document Images

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Consistent Partition and Labelling of Text Blocks

Abstract:

Access this article

Similar content being viewed by others

Text Segmentation for Document Recognition

Page Segmentation Techniques in Document Analysis

Connected Operators for Non-text Object Segmentation in Grayscale Document Images

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation