Table extraction using spatial reasoning on the CSS2 visual box model

Authors:
Wolfgang Gatterbauer;Paul Bohunsky
Affiliations:
Database and Artificial Intelligence Group, Vienna University of Technology, Austria;Database and Artificial Intelligence Group, Vienna University of Technology, Austria
Venue:
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Year:
2006

Citing 15
Cited 13

Maintaining knowledge about temporal intervals

Communications of the ACM
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Using visual cues for extraction of tabular data from arbitrary HTML documents

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Visually guided bottom-up table detection and segmentation in web documents

Proceedings of the 15th international conference on World Wide Web
Estimating required recall for successful knowledge acquisition from the web

Proceedings of the 15th international conference on World Wide Web
Thesis: automatic ontology generation from web tabular structures

AI Communications
Methods for domain-independent information extraction from the web: an experimental comparison

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Converting PDF to HTML approach based on text detection

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
Visual extraction of information from web pages

Journal of Visual Languages and Computing
Automatic document structure detection for data integration

BIS'07 Proceedings of the 10th international conference on Business information systems
Visual structure-based web page clustering and retrieval

Proceedings of the 19th international conference on World wide web
Automatic hidden-web table interpretation by sibling page comparison

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Using ontologies for extracting product features from web pages

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Finding related tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A general theory of spatial relations to support a graphical tool for visual information extraction

Journal of Visual Languages and Computing
Spatial reasoning with rectangular cardinal relations

Annals of Mathematics and Artificial Intelligence
Synthesizing union tables from the web

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tables on web pages contain a huge amount of semantically explicit information, which makes them a worthwhile target for automatic information extraction and knowledge acquisition from the Web. However, the task of table extraction from web pages is difficult, because of HTML's design purpose to convey visual instead of semantic information. In this paper, we propose a robust technique for table extraction from arbitrary web pages. This technique relies upon the positional information of visualized DOM element nodes in a browser and, hereby, separates the intricacies of code implementation from the actual intended visual appearance. The novel aspect of the proposed web table extraction technique is the effective use of spatial reasoning on the CSS2 visual box model, which shows a high level of robustness even without any form of learning (F-measure ≈ 90%). We describe the ideas behind our approach, the tabular pattern recognition algorithm operating on a double topographical grid structure and allowing for effective and robust extraction, and general observations on web tables that should be borne in mind by any automatic web table extraction mechanism.