Context-aware wrapping: synchronized data extraction

Authors:
Shui-Lung Chuang;Kevin Chen-Chuan Chang;ChengXiang Zhai
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 18
Cited 18

Information extraction

Communications of the ACM
A view of the EM algorithm that justifies incremental, sparse, and other variants

Proceedings of the NATO Advanced Study Institute on Learning in graphical models
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
A brief survey of web data extraction tools

ACM SIGMOD Record
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

An unsupervised framework for extracting and normalizing product attributes from multiple web sites

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Integrating web query results: holistic schema matching

Proceedings of the 17th ACM conference on Information and knowledge management
Automatic wrapper induction from hidden-web sources with domain knowledge

Proceedings of the 10th ACM workshop on Web information and data management
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Dynamic personalization for meta-queriers

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Cost-effective web search in bootstrapping for named entity recognition

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
ONDUX: on-demand unsupervised learning for information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Unsupervised strategies for information extraction by text segmentation

Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
Redundancy-driven web data extraction and integration

Procceedings of the 13th International Workshop on the Web and Databases
ObjectRunner: lightweight, targeted extraction and querying of structured web data

Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
On-line web database integration

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Wrapper Generation for Overlapping Web Sources

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Semi-supervised multi-task learning of structured prediction models for web information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
Deep web integrated systems: current achievements and open issues

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Discovering interesting information with advances in web technology

ACM SIGKDD Explorations Newsletter
Extraction and integration of partially overlapping web sources

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The deep Web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic yet accurate in building an integration system, we observe two problems: First, across sequential tasks in integration, how can a wrapper (as an extraction task) consider the peer sources to facilitate the subsequent matching task? Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy? These issues, while seemingly unrelated, both boil down to the lack of "context awareness": Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration. We propose the concept of context-aware wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization framework to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and develop the turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping. Our experiments show that the turbo syncer can, on the one hand, enhance extraction consistency and thus increase matching accuracy (from 17--83% to 78--94% in F-measure) and, on the other hand, incorporate peer wrappers and domain knowledge seamlessly to reduce extraction errors (from 09--60% to 01--11%).