ABSTRACT
A wealth of information relevant for e-commerce often appears in text form. This includes specification and performance data sheets of products, financial statements, product offerings etc. Typically these types of product and financial data are published in tabular form. The only separators between items in the table are white spaces and line separators. We will refer to such tables as text tables. Due to the lack of structure in such tables, the information present is not readily queriable using traditional database query languages like SQL. One way to make it amenable to standard database querying techniques is to extract the data items in the tables and create a database out of the extracted data. But extraction from text tables poses difficulties due to the irregularity of the data in the column.
Existing techniques like [1] and [3] are based on finding fixed separators between successive columns. However, it is not always possible to find fixed separators. Even if fixed separators exist they may not unambiguously separate columns that have multiword items. Another set of techniques are based on regular expressions. The problems here are: (i) they are difficult to construct and (ii) they depend on lexical similarity between column items.
Note that, by visual inspection a casual observer can correctly associate every item in a text table to its corresponding column. This is because all the items belonging to a column appear "clustered" more closely to each other than to items in different columns. Whereas such clusters can be clearly discerned by a human observer, making them machine recognizable is the key to robust automated extraction of data items from text-based tables. Clustering enables us to make associations between items in a column based not merely on examining items in adjacent rows but across all the rows in the table.
We have designed and implemented the CuteX system for extracting data from irregular text tables. The input is a file containing only text tables. The output produced by CuteX is an association between every items in a column. Note that CuteX does not do table detection in text. The innovative aspect of CuteX is its clustering-based algorithm that drives the extraction process. In CuteX each line is broken down into a set of tokens. Each token is a contiguous sequence of non white-space characters. The center of any token in a cluster is closer to the center of some other token in the same cluster. Inter-cluster gaps are gaps between the extremal tokens in the clusters. Starting with an initial set of clusters, adjacent clusters are merged into bigger clusters based on the inter-cluster gaps. The algorithm terminates when no more clusters can be merged. We have formalized the notion of a correct extraction and developed a syntactic characterization of tables on which this algorithm will always produce a correct extraction. Details appear in [2]. An unique aspect of the algorithm is its robustness in the presence of misalignments.
Precision of extraction can be improved by supplying the minimum separation between columns as a parameter. Such a separator is estimated by sampling a few input tables. The clustering algorithm does not merge adjacent clusters if the gap between them is larger than this parameter value. Note though that the minimum column gap cannot be used as a fixed separator since doing so amounts to doing localized determination, making it brittle to misalignments.
CuteX is implemented in Java and is approximately about 3000 lines of code. The system automatically partitions the set of input text tables into directories containing correct and incorrect extractions. At the end of an extraction, the user can examine the directory containing incorrectly extracted tables, sample a few of them, identify if it was caused by an erroneous estimate of the minimum column gap, re-adjust the configuration parameter and start a new extraction on all these tables. Successive iterations can generate a higher extraction yield.
The primary focus of the demonstration will be on illustrating the robustness and the iterative process of improving the extraction yield of the clustering algorithm.
- Brad Adelberg. NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. In ACM SIGMOD, 1998. Google ScholarDigital Library
- Hasan Davulcu and Saikat Mukherjee and I.V. Ramakrishnan. A Clustering Technique for Mining Data from Text Tables. In SIAM ICDM, 2002.Google ScholarCross Ref
- Pallavi Pyreddy and W. Bruce Croft. TINTIN: A System for Retrieval in Text Tables. In ACM DL, 1997. Google ScholarDigital Library
Recommendations
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology
This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
On cluster tree for nested and multi-density data clustering
Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
Comments