CuTeX: a system for extracting data from text tables

Authors:
Hasan Davulcu;Saikat Mukherjee;Arvind Seth;I. V. Ramakrishnan
Affiliations:
SUNY Stony Brook, Stony Brook, NY;SUNY Stony Brook, Stony Brook, NY;SUNY Stony Brook, Stony Brook, NY;SUNY Stony Brook, Stony Brook, NY
Venue:
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2002

Citing 2
Cited 0

TINTIN: a system for retrieval in text tables

DL '97 Proceedings of the second ACM international conference on Digital libraries
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

A wealth of information relevant for e-commerce often appears intext form. This includes specification and performance data sheetsof products, financial statements, product offerings etc. Typicallythese types of product and financial data are published in tabularform. The only separators between items in the table are whitespaces and line separators. We will refer to such tables as texttables. Due to the lack of structure in such tables, theinformation present is not readily queriable using traditionaldatabase query languages like SQL. One way to make it amenable tostandard database querying techniques is to extract the data itemsin the tables and create a database out of the extracted data. Butextraction from text tables poses difficulties due to theirregularity of the data in the column.Existing techniques like [1] and [3] are based on finding fixedseparators between successive columns. However, it is not alwayspossible to find fixed separators. Even if fixed separators existthey may not unambiguously separate columns that have multiworditems. Another set of techniques are based on regular expressions.The problems here are: (i) they are difficult to construct and (ii)they depend on lexical similarity between column items.Note that, by visual inspection a casual observer can correctlyassociate every item in a text table to its corresponding column.This is because all the items belonging to a column appear"clustered" more closely to each other than to items in differentcolumns. Whereas such clusters can be clearly discerned by a humanobserver, making them machine recognizable is the key to robustautomated extraction of data items from text-based tables.Clustering enables us to make associations between items in acolumn based not merely on examining items in adjacent rows butacross all the rows in the table.We have designed and implemented the CuteX system forextracting data from irregular text tables. The input is a filecontaining only text tables. The output produced by CuteX isan association between every items in a column. Note thatCuteX does not do table detection in text. The innovativeaspect of CuteX is its clustering-based algorithm thatdrives the extraction process. In CuteX each line is brokendown into a set of tokens. Each token is a contiguous sequence ofnon white-space characters. The center of any token in a cluster iscloser to the center of some other token in the same cluster.Inter-cluster gaps are gaps between the extremal tokens in theclusters. Starting with an initial set of clusters, adjacentclusters are merged into bigger clusters based on the inter-clustergaps. The algorithm terminates when no more clusters can be merged.We have formalized the notion of a correct extraction and developeda syntactic characterization of tables on which this algorithm willalways produce a correct extraction. Details appear in [2]. Anunique aspect of the algorithm is its robustness in the presence ofmisalignments.Precision of extraction can be improved by supplying the minimumseparation between columns as a parameter. Such a separator isestimated by sampling a few input tables. The clustering algorithmdoes not merge adjacent clusters if the gap between them is largerthan this parameter value. Note though that the minimum column gapcannot be used as a fixed separator since doing so amounts to doinglocalized determination, making it brittle to misalignments.CuteX is implemented in Java and is approximately about3000 lines of code. The system automatically partitions the set ofinput text tables into directories containing correct and incorrectextractions. At the end of an extraction, the user can examine thedirectory containing incorrectly extracted tables, sample a few ofthem, identify if it was caused by an erroneous estimate of theminimum column gap, re-adjust the configuration parameter and starta new extraction on all these tables. Successive iterations cangenerate a higher extraction yield.The primary focus of the demonstration will be on illustratingthe robustness and the iterative process of improving theextraction yield of the clustering algorithm.