ODE: Ontology-assisted data extraction

Authors:
Weifeng Su;Jiying Wang;Frederick H. Lochovsky
Affiliations:
BNU-HKBU United International College and Shenzhen Key Laboratory of Intelligent Media and Speech, PKU-HKUST Shenzhen Hong Kong Institution, China;City University of Hong Kong, Kowloon, Hong Kong;The Hong Kong University of Science and Technology, Kowloon, Hong Kong
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2009

Citing 33
Cited 19

Algorithms for string searching

ACM SIGIR Forum
A maximum entropy approach to natural language processing

Computational Linguistics
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
KBFS: K-Best-First Search

Annals of Mathematics and Artificial Intelligence
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
A maximum entropy approach to named entity recognition

A maximum entropy approach to named entity recognition
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Automatic composite wrapper generation for semi-structured biological data based on table structure identification

ACM SIGMOD Record
Structured databases on the web: observations and implications

ACM SIGMOD Record
Schema Matching Using Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Towards Ontology Generation from Tables

World Wide Web
Automatic complex schema matching across Web query interfaces: A correlation mining approach

ACM Transactions on Database Systems (TODS)
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Automatic hidden-web table interpretation by sibling page comparison

ER'07 Proceedings of the 26th international conference on Conceptual modeling
Holistic schema matching for web query interfaces

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
OntoBuilder: fully automatic extraction and consolidation of ontologies from web sources using sequence semantics

EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology
Bootstrapping domain ontology for semantic web services from source web sites

TES'05 Proceedings of the 6th international conference on Technologies for E-Services

Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
ObjectRunner: lightweight, targeted extraction and querying of structured web data

Proceedings of the VLDB Endowment
Real understanding of real estate forms

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
How the minotaur turned into ariadne: ontologies in web data extraction

ICWE'11 Proceedings of the 11th international conference on Web engineering
Little knowledge rules the web: domain-centric result page extraction

RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Theoretical foundations for enabling a web of knowledge

FoIKS'10 Proceedings of the 6th international conference on Foundations of Information and Knowledge Systems
AMBER: turning annotations into knowledge

Proceedings of the 21st international conference companion on World Wide Web
Automatically learning gazetteers from the deep web

Proceedings of the 21st international conference companion on World Wide Web
Data extraction for search engine using safe matching

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Multiple sections extraction using visual cue

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Towards discovering ontological models from big RDF data

ER'12 Proceedings of the 2012 international conference on Advances in Conceptual Modeling
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Understanding query interfaces by statistical parsing

ACM Transactions on the Web (TWEB)
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

Proceedings of International Conference on Information Integration and Web-based Applications & Services
The ontological key: automatically understanding and integrating forms to access the deep Web

The VLDB Journal — The International Journal on Very Large Data Bases
Framework for surveillance of instant messages

International Journal of Internet Technology and Secured Transactions

Quantified Score

Hi-index	0.00

Visualization

Abstract

Online databases respond to a user query with result records encoded in HTML files. Data extraction, which is important for many applications, extracts the records from the HTML files automatically. We present a novel data extraction method, ODE (Ontology-assisted Data Extraction), which automatically extracts the query result records from the HTML pages. ODE first constructs an ontology for a domain according to information matching between the query interfaces and query result pages from different Web sites within the same domain. Then, the constructed domain ontology is used during data extraction to identify the query result section in a query result page and to align and label the data values in the extracted records. The ontology-assisted data extraction method is fully automatic and overcomes many of the deficiencies of current automatic data extraction methods. Experimental results show that ODE is extremely accurate for identifying the query result section in an HTML page, segmenting the query result section into query result records, and aligning and labeling the data values in the query result records.