Extraction and integration of partially overlapping web sources

Authors:
Mirko Bronzi;Valter Crescenzi;Paolo Merialdo;Paolo Papotti
Affiliations:
Università degli Studi Roma Tre, Rome, Italy;Università degli Studi Roma Tre, Rome, Italy;Università degli Studi Roma Tre, Rome, Italy;Qatar Computing Research Institute, Doha, Qatar
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 27
Cited 0

Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
On schema matching with opaque column names and data values

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Toward best-effort information extraction

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Introduction to Information Retrieval

Introduction to Information Retrieval
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Supporting the automatic construction of entity aware search engines

Proceedings of the 10th ACM workshop on Web information and data management
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
Learning semantic definitions of online information sources

Journal of Artificial Intelligence Research
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Uninterpreted Schema Matching with Embedded Value Mapping under Opaque Column Names and Data Values

IEEE Transactions on Knowledge and Data Engineering
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Data integration for the relational web

Proceedings of the VLDB Endowment
Automatically Constructing Semantic Web Services from Online Sources

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach

IEEE Transactions on Knowledge and Data Engineering
Labeling data extracted from the web

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
Probabilistic models to reconcile complex data from inaccurate data sources

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Web-scale information extraction with vertex

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
From one tree to a forest: a unified solution for structured web data extraction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An analysis of structured data on the web

Proceedings of the VLDB Endowment
Open information extraction: the second generation

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One
Truth finding on the deep web: is the problem solved?

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an unsupervised approach for harvesting the data exposed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tackling two problems: the data extraction problem, to generate extraction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, WEIR, to solve the stated problems and formally prove its correctness. WEIR leverages the overlapping data among sources to make better decisions both in the data extraction (by pruning rules that do not lead to redundant information) and in the data integration (by reflecting local properties of a source over the mediated schema). Along the way, we characterize the amount of redundancy needed by our algorithm to produce a solution, and present experimental results to show the benefits of our approach with respect to existing solutions.