A Scalable Hybrid Approach for Extracting Head Components from Web Tables

  • Authors:
  • Sung-Won Jung;Hyuk-Chul Kwon

  • Affiliations:
  • IEEE;IEEE

  • Venue:
  • IEEE Transactions on Knowledge and Data Engineering
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.