Automating the extraction of data from HTML tables with unknown structure

Authors:
David W. Embley;Cui Tao;Stephen W. Liddle
Affiliations:
Department of Computer Science, Brigham Young University, Provo, UT;Department of Computer Science, Brigham Young University, Provo, UT;Information Systems Group and Rollins eBusiness Center, Brigham Young University, Provo, UT
Venue:
Data & Knowledge Engineering - Special issue: ER 2002
Year:
2005

Citing 18
Cited 12

Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
An automated approach for retrieving hierarchical data from HTML tables

Proceedings of the eighth international conference on Information and knowledge management
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Database System Concepts

Database System Concepts
Information Integration Using Logical Views

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Schema Mapping as Query Discovery

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Querying Heterogeneous Information Sources Using Source Descriptions

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
On the Automatic Extraction of Data from the Hidden Web

Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Recognizing Ontology-Applicable Multiple-Record Web Documents

ER '01 Proceedings of the 20th International Conference on Conceptual Modeling: Conceptual Modeling
Extracting information from heterogeneous information sources using ontologically specified target views

Information Systems
Why Table Ground-Truthing is Hard

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition

Toward semantic understanding: an approach based on information extraction ontologies

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Towards Ontology Generation from Tables

World Wide Web
The role of domain ontologies in database design: An ontology management and conceptual modeling environment

ACM Transactions on Database Systems (TODS)
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

BIS'07 Proceedings of the 10th international conference on Business information systems
Development of automatic web accessibility checking modules for advanced quality assurance tools

UAHCI'07 Proceedings of the 4th international conference on Universal access in human computer interaction: coping with diversity
Automatic hidden-web table interpretation by sibling page comparison

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Analysis and taxonomy of column header categories for web tables

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Mining for attributes and values in tables

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
DART: a data acquisition and repairing tool

EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology
Notes on contemporary table recognition

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Web table taxonomy and formalization

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data on the Web in HTML tables is mostly structured, but we usually do not know the structure in advance. Thus, we cannot directly query for data of interest. We propose a solution to this problem based on document-independent extraction ontologies. Our solution entails elements of table understanding, data integration, and wrapper creation. Table understanding allows us to find tables of interest within a Web page, recognize attributes and values within the table, pair attributes with values, and form records. Data-integration techniques allow us to match source records with a target schema. Ontologically specified wrappers allow us to extract data from source records into a target schema. Experimental results show that we can successfully locate data of interest in tables and map the data from source HTML tables with unknown structure to a given target database schema. We can thus "directly" query source data with unknown structure through a known target schema.