Redundancy-driven web data extraction and integration

Authors:
Lorenzo Blanco;Mirko Bronzi;Valter Crescenzi;Paolo Merialdo;Paolo Papotti
Affiliations:
Università degli Studi Roma Tre;Università degli Studi Roma Tre;Università degli Studi Roma Tre;Università degli Studi Roma Tre;Università degli Studi Roma Tre
Venue:
Procceedings of the 13th International Workshop on the Web and Databases
Year:
2010

Citing 12
Cited 3

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Schema Matching Using Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Bootstrapping pay-as-you-go data integration systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Toward best-effort information extraction

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Supporting the automatic construction of entity aware search engines

Proceedings of the 10th ACM workshop on Web information and data management
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Integrating conflicting data: the role of source dependence

Proceedings of the VLDB Endowment
Data integration for the relational web

Proceedings of the VLDB Endowment

Automatically building probabilistic databases from the web

Proceedings of the 20th international conference companion on World wide web
Characterizing the uncertainty of web data: models and experiences

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Web data reconciliation: models and experiences

Search Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages confirm the feasibility and quality of the approach.