skip to main content
10.1145/564376.564498acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

CuTeX: a system for extracting data from text tables

Published:11 August 2002Publication History

ABSTRACT

A wealth of information relevant for e-commerce often appears in text form. This includes specification and performance data sheets of products, financial statements, product offerings etc. Typically these types of product and financial data are published in tabular form. The only separators between items in the table are white spaces and line separators. We will refer to such tables as text tables. Due to the lack of structure in such tables, the information present is not readily queriable using traditional database query languages like SQL. One way to make it amenable to standard database querying techniques is to extract the data items in the tables and create a database out of the extracted data. But extraction from text tables poses difficulties due to the irregularity of the data in the column.

Existing techniques like [1] and [3] are based on finding fixed separators between successive columns. However, it is not always possible to find fixed separators. Even if fixed separators exist they may not unambiguously separate columns that have multiword items. Another set of techniques are based on regular expressions. The problems here are: (i) they are difficult to construct and (ii) they depend on lexical similarity between column items.

Note that, by visual inspection a casual observer can correctly associate every item in a text table to its corresponding column. This is because all the items belonging to a column appear "clustered" more closely to each other than to items in different columns. Whereas such clusters can be clearly discerned by a human observer, making them machine recognizable is the key to robust automated extraction of data items from text-based tables. Clustering enables us to make associations between items in a column based not merely on examining items in adjacent rows but across all the rows in the table.

We have designed and implemented the CuteX system for extracting data from irregular text tables. The input is a file containing only text tables. The output produced by CuteX is an association between every items in a column. Note that CuteX does not do table detection in text. The innovative aspect of CuteX is its clustering-based algorithm that drives the extraction process. In CuteX each line is broken down into a set of tokens. Each token is a contiguous sequence of non white-space characters. The center of any token in a cluster is closer to the center of some other token in the same cluster. Inter-cluster gaps are gaps between the extremal tokens in the clusters. Starting with an initial set of clusters, adjacent clusters are merged into bigger clusters based on the inter-cluster gaps. The algorithm terminates when no more clusters can be merged. We have formalized the notion of a correct extraction and developed a syntactic characterization of tables on which this algorithm will always produce a correct extraction. Details appear in [2]. An unique aspect of the algorithm is its robustness in the presence of misalignments.

Precision of extraction can be improved by supplying the minimum separation between columns as a parameter. Such a separator is estimated by sampling a few input tables. The clustering algorithm does not merge adjacent clusters if the gap between them is larger than this parameter value. Note though that the minimum column gap cannot be used as a fixed separator since doing so amounts to doing localized determination, making it brittle to misalignments.

CuteX is implemented in Java and is approximately about 3000 lines of code. The system automatically partitions the set of input text tables into directories containing correct and incorrect extractions. At the end of an extraction, the user can examine the directory containing incorrectly extracted tables, sample a few of them, identify if it was caused by an erroneous estimate of the minimum column gap, re-adjust the configuration parameter and start a new extraction on all these tables. Successive iterations can generate a higher extraction yield.

The primary focus of the demonstration will be on illustrating the robustness and the iterative process of improving the extraction yield of the clustering algorithm.

References

  1. Brad Adelberg. NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. In ACM SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Hasan Davulcu and Saikat Mukherjee and I.V. Ramakrishnan. A Clustering Technique for Mining Data from Text Tables. In SIAM ICDM, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  3. Pallavi Pyreddy and W. Bruce Croft. TINTIN: A System for Retrieval in Text Tables. In ACM DL, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
    August 2002
    478 pages
    ISBN:1581135610
    DOI:10.1145/564376

    Copyright © 2002 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 11 August 2002

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%
  • Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader