A Scalable Hybrid Approach for Extracting Head Components from Web Tables

Authors:
Sung-Won Jung;Hyuk-Chul Kwon
Affiliations:
IEEE;IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2006

Citing 4
Cited 6

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Effective Retrieval of Information in Tables on the Internet

IEA/AIE '02 Proceedings of the 15th international conference on Industrial and engineering applications of artificial intelligence and expert systems: developments in applied artificial intelligence
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Extracting logical structures from HTML tables

Computer Standards & Interfaces
Discriminating Meaningful Web Tables from Decorative Tables Using a Composite Kernel

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Constructing domain ontology using structural and semantic characteristics of web-table head

IEA/AIE'07 Proceedings of the 20th international conference on Industrial, engineering, and other applications of applied intelligent systems
Hybrid approach to extracting information from web-tables

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
A machine learning based approach for separating head from body in web-tables

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Web table discrimination with composition of rich structural and content information

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.