Detecting Tables in HTML Documents

Authors:
Yalin Wang;Jianying Hu
Affiliations:
-;-
Venue:
DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Year:
2002

Citing 6
Cited 12

Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Computer and Robot Vision

Computer and Robot Vision
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Transforming arbitrary tables into logical form with TARTAR

Data & Knowledge Engineering
Extraction and segmentation of tables from Chinese ink documents based on a matrix model

Pattern Recognition
TableSeer: automatic table metadata extraction and searching in digital libraries

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Extracting logical structures from HTML tables

Computer Standards & Interfaces
Identifying table boundaries in digital documents via sparse line detection

Proceedings of the 17th ACM conference on Information and knowledge management
Detecting tables in Web documents

Engineering Applications of Artificial Intelligence
From tables to frames

Web Semantics: Science, Services and Agents on the World Wide Web
Detecting and recognizing tables in spreadsheets

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
An efficient pre-processing method to identify logical components from PDF documents

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Towards generic framework for tabular data extraction and management in documents

Proceedings of the sixth workshop on Ph.D. students in information and knowledge management
Web table taxonomy and formalization

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as 驴table驴 elements, a 驴table驴 element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the web domain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different web sites from various domains and contains over 10,000 leaf 驴table驴 elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.