Exploiting information redundancy to wring out structured data from the web

Authors:
Lorenzo Blanco;Mirko Bronzi;Valter Crescenzi;Paolo Merialdo;Paolo Papotti
Affiliations:
Università degli Studi Roma Tre, Roma, Italy;Università degli Studi Roma Tre, Roma, Italy;Università degli Studi Roma Tre, Roma, Italy;Università degli Studi Roma Tre, Roma, Italy;Università degli Studi Roma Tre, Roma, Italy
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 4
Cited 2

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Open information extraction from the web

Communications of the ACM - Surviving the data deluge
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Supporting the automatic construction of entity aware search engines

Proceedings of the 10th ACM workshop on Web information and data management

Characterizing the uncertainty of web data: models and experiences

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Web data reconciliation: models and experiences

Search Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.