RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Open information extraction from the web
Communications of the ACM - Surviving the data deluge
WebTables: exploring the power of tables on the web
Proceedings of the VLDB Endowment
Supporting the automatic construction of entity aware search engines
Proceedings of the 10th ACM workshop on Web information and data management
Characterizing the uncertainty of web data: models and experiences
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Web data reconciliation: models and experiences
Search Computing
Hi-index | 0.00 |
A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g. financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments confirmed the quality and the feasibility of the approach.