Detecting tables in Web documents

Authors:
Yeon-Seok Kim;Kyong-Ho Lee
Affiliations:
Department of Computer Science, Yonsei University, 134 Shinchon-dong, Seodaemun-ku, Seoul 120-749, Republic of Korea;Department of Computer Science, Yonsei University, 134 Shinchon-dong, Seodaemun-ku, Seoul 120-749, Republic of Korea
Venue:
Engineering Applications of Artificial Intelligence
Year:
2005

Citing 4
Cited 3

The weighted majority algorithm

Information and Computation
Detecting Tables in HTML Documents

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Logical structure analysis: From HTML to XML

Computer Standards & Interfaces
Extracting logical structures from HTML tables

Computer Standards & Interfaces
Mining for attributes and values in tables

Proceedings of the International Conference on Management of Emergent Digital EcoSystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this paper presents an efficient method for sophisticated table detection. The proposed method consists of two phases: preprocessing and attribute-value relations extraction. During preprocessing, a part of genuine or non-genuine tables are filtered out using a set of rules, which are devised based on careful examination of general characteristics of various HTML tables. The remaining tables are detected at the attribute-value relations extraction phase. Specifically, a value area is extracted and checked out whether there is syntactic coherency. Furthermore, the method looks for semantic coherency between an attribute area and a value area of a table. Experimental results with 11,477 TABLE tags from 1393 HTML documents show that the method has performed better compared with previous works, resulting in a precision of 97.54% and a recall of 99.22%.