Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Authors:
Cui Tao;David W. Embley
Affiliations:
Department of Computer Science, Brigham Young University, Provo, UT 84602, USA;Department of Computer Science, Brigham Young University, Provo, UT 84602, USA
Venue:
Data & Knowledge Engineering
Year:
2009

Citing 35
Cited 10

NFQL: the natural forms query language

ACM Transactions on Database Systems (TODS)
Identifying syntactic differences between two programs

Software—Practice & Experience
An automated approach for retrieving hierarchical data from HTML tables

Proceedings of the eighth international conference on Information and knowledge management
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Probe, count, and classify: categorizing hidden web databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Creating Semantic Web Contents with Protégé-2000

IEEE Intelligent Systems
MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup

EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
S-CREAM - Semi-automatic CREAtion of Metadata

EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
OntoWeb - A Semantic Web Community Portal

PAKM '02 Proceedings of the 4th International Conference on Practical Aspects of Knowledge Management
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Tabular abstraction, editing, and formatting

Tabular abstraction, editing, and formatting
Towards the self-annotating web

Proceedings of the 13th international conference on World Wide Web
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A survey of table recognition: Models, observations, transformations, and inferences

International Journal on Document Analysis and Recognition
KIM – a semantic platform for information extraction and retrieval

Natural Language Engineering
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Automating the extraction of data from HTML tables with unknown structure

Data & Knowledge Engineering - Special issue: ER 2002
Towards Ontology Generation from Tables

World Wide Web
Thesis: automatic ontology generation from web tabular structures

AI Communications
Learning table extraction from examples

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Quantitative and qualitative evaluation of the OntoLearn ontology learning system

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Ontology aware software service agents: meeting ordinary user needs on the semantic web

Ontology aware software service agents: meeting ordinary user needs on the semantic web
Table extraction using spatial reasoning on the CSS2 visual box model

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Automatic hidden-web table interpretation by sibling page comparison

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Seed-based generation of personalized bio-ontologies for information extraction

ER'07 Proceedings of the 2007 conference on Advances in conceptual modeling: foundations and applications
Enriching OWL with instance recognition semantics for automated semantic annotation

ER'07 Proceedings of the 2007 conference on Advances in conceptual modeling: foundations and applications
Using ontologies for extracting product features from web pages

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Text2Onto: a framework for ontology learning and data-driven change discovery

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Automatic creation and simplified querying of semantic web content: an approach based on information-extraction ontologies

ASWC'06 Proceedings of the First Asian conference on The Semantic Web

ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
FOCIH: Form-Based Ontology Creation and Information Harvesting

ER '09 Proceedings of the 28th International Conference on Conceptual Modeling
A methodology to learn ontological attributes from the Web

Data & Knowledge Engineering
Analysis and taxonomy of column header categories for web tables

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Evaluating ontology extraction tools using a comprehensive evaluation framework

Data & Knowledge Engineering
Enabling search for facts and implied facts in historical documents

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Theoretical foundations for enabling a web of knowledge

FoIKS'10 Proceedings of the 6th international conference on Foundations of Information and Knowledge Systems
Financial news semantic search engine

Expert Systems with Applications: An International Journal
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The longstanding problem of automatic table interpretation still eludes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines.