Article

CuTeX: a system for extracting data from text tables

Authors:
Hasan Davulcu

SUNY Stony Brook, Stony Brook, NY

SUNY Stony Brook, Stony Brook, NY
View Profile

,
Saikat Mukherjee

SUNY Stony Brook, Stony Brook, NY

SUNY Stony Brook, Stony Brook, NY
View Profile

,
Arvind Seth

SUNY Stony Brook, Stony Brook, NY

SUNY Stony Brook, Stony Brook, NY
View Profile

,
I. V. Ramakrishnan

SUNY Stony Brook, Stony Brook, NY

SUNY Stony Brook, Stony Brook, NY
View Profile

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2002Pages 457https://doi.org/10.1145/564376.564498

Published:11 August 2002Publication History

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 457

ABSTRACT

A wealth of information relevant for e-commerce often appears in text form. This includes specification and performance data sheets of products, financial statements, product offerings etc. Typically these types of product and financial data are published in tabular form. The only separators between items in the table are white spaces and line separators. We will refer to such tables as text tables. Due to the lack of structure in such tables, the information present is not readily queriable using traditional database query languages like SQL. One way to make it amenable to standard database querying techniques is to extract the data items in the tables and create a database out of the extracted data. But extraction from text tables poses difficulties due to the irregularity of the data in the column.

Existing techniques like [1] and [3] are based on finding fixed separators between successive columns. However, it is not always possible to find fixed separators. Even if fixed separators exist they may not unambiguously separate columns that have multiword items. Another set of techniques are based on regular expressions. The problems here are: (i) they are difficult to construct and (ii) they depend on lexical similarity between column items.

Note that, by visual inspection a casual observer can correctly associate every item in a text table to its corresponding column. This is because all the items belonging to a column appear "clustered" more closely to each other than to items in different columns. Whereas such clusters can be clearly discerned by a human observer, making them machine recognizable is the key to robust automated extraction of data items from text-based tables. Clustering enables us to make associations between items in a column based not merely on examining items in adjacent rows but across all the rows in the table.

We have designed and implemented the CuteX system for extracting data from irregular text tables. The input is a file containing only text tables. The output produced by CuteX is an association between every items in a column. Note that CuteX does not do table detection in text. The innovative aspect of CuteX is its clustering-based algorithm that drives the extraction process. In CuteX each line is broken down into a set of tokens. Each token is a contiguous sequence of non white-space characters. The center of any token in a cluster is closer to the center of some other token in the same cluster. Inter-cluster gaps are gaps between the extremal tokens in the clusters. Starting with an initial set of clusters, adjacent clusters are merged into bigger clusters based on the inter-cluster gaps. The algorithm terminates when no more clusters can be merged. We have formalized the notion of a correct extraction and developed a syntactic characterization of tables on which this algorithm will always produce a correct extraction. Details appear in [2]. An unique aspect of the algorithm is its robustness in the presence of misalignments.

Precision of extraction can be improved by supplying the minimum separation between columns as a parameter. Such a separator is estimated by sampling a few input tables. The clustering algorithm does not merge adjacent clusters if the gap between them is larger than this parameter value. Note though that the minimum column gap cannot be used as a fixed separator since doing so amounts to doing localized determination, making it brittle to misalignments.

CuteX is implemented in Java and is approximately about 3000 lines of code. The system automatically partitions the set of input text tables into directories containing correct and incorrect extractions. At the end of an extraction, the user can examine the directory containing incorrectly extracted tables, sample a few of them, identify if it was caused by an erroneous estimate of the minimum column gap, re-adjust the configuration parameter and start a new extraction on all these tables. Successive iterations can generate a higher extraction yield.

The primary focus of the demonstration will be on illustrating the robustness and the iterative process of improving the extraction yield of the clustering algorithm.

References

Brad Adelberg. NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. In ACM SIGMOD, 1998. Google ScholarDigital Library
Hasan Davulcu and Saikat Mukherjee and I.V. Ramakrishnan. A Clustering Technique for Mining Data from Text Tables. In SIAM ICDM, 2002.Google ScholarCross Ref
Pallavi Pyreddy and W. Bruce Croft. TINTIN: A System for Retrieval in Text Tables. In ACM DL, 1997. Google ScholarDigital Library

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
Read More
On cluster tree for nested and multi-density data clustering

Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
August 2002
478 pages
ISBN:1581135610
DOI:10.1145/564376
General Chair:
Kalervo Järvelin
University of Tampere, Finland
,
Program Chairs:
Micheline Beaulieu
University of Sheffield, UK
,
Ricardo Baeza-Yates
University of Chile, Chile
,
Sung Hyon Myaeng
Chungnam National University, Korea
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 351
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CuTeX: a system for extracting data from text tables

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

On cluster tree for nested and multi-density data clustering