Discriminating Meaningful Web Tables from Decorative Tables Using a Composite Kernel

Authors:
Jeong-Woo Son;Jae-An Lee;Seong-Bae Park;Hyun-Je Song;Sang-Jo Lee;Se-Young Park
Affiliations:
-;-;-;-;-;-
Venue:
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2008

Citing 6
Cited 2

An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Effective Retrieval of Information in Tables on the Internet

IEA/AIE '02 Proceedings of the 15th international conference on Industrial and engineering applications of artificial intelligence and expert systems: developments in applied artificial intelligence
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
A Scalable Hybrid Approach for Extracting Head Components from Web Tables

IEEE Transactions on Knowledge and Data Engineering

Information extraction from web tables

Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services
Web table discrimination with composition of rich structural and content information

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction from world wide web has been paid great attention to. Since a table is a well-organized and summarized knowledge expression for a domain, it is of great importance to extract information from the tables. However, many tables in web pages are used not to transfer information but to decorate the pages. Therefore, it is one of the most critical tasks in web table mining to discriminate the meaningful tables from the decorative ones. The main obstacle of this task comes from the difficulty of generating relevant features for the discrimination. This paper proposes a novel method to discriminate them using a composite kernel which combines a parse tree kernel and a linear kernel. Since a web table is represented as a parse tree by a HTML parser, the parse tree kernel can be naturally used in determining the similarity between trees, and the linear kernel with content features is used to make up for the weak points of the parse tree kernel. The support vector machines with the composite kernel distinguish with high accuracy the meaningful tables from the decorative ones. A series of experiments show that the proposed method achieves the state-of-the-art performance.