Automatic hidden-web table interpretation by sibling page comparison

Authors:
Cui Tao;David W. Embley
Affiliations:
Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah
Venue:
ER'07 Proceedings of the 26th international conference on Conceptual modeling
Year:
2007

Citing 15
Cited 5

Identifying syntactic differences between two programs

Software—Practice & Experience
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Active Learning for Natural Language Parsing and Information Extraction

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Tabular abstraction, editing, and formatting

Tabular abstraction, editing, and formatting
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A survey of table recognition: Models, observations, transformations, and inferences

International Journal on Document Analysis and Recognition
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Automating the extraction of data from HTML tables with unknown structure

Data & Knowledge Engineering - Special issue: ER 2002
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2

ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
From Tessellations to Table Interpretation

Calculemus '09/MKM '09 Proceedings of the 16th Symposium, 8th International Conference. Held as Part of CICM '09 on Intelligent Computer Mathematics
Seed-based generation of personalized bio-ontologies for information extraction

ER'07 Proceedings of the 2007 conference on Advances in conceptual modeling: foundations and applications
Towards generic framework for tabular data extraction and management in documents

Proceedings of the sixth workshop on Ph.D. students in information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured data management. In this paper, we offer a conceptual modeling solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. We compare them to identify and connect nonvarying components (category labels) and varying components (data values).We tested our solution using more than 2, 000 tables in source pages from three different domains--car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns, if necessary, as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%.