Snowball: extracting relations from large plain-text collections
DL '00 Proceedings of the fifth ACM conference on Digital libraries
On schema matching with opaque column names and data values
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Context-aware wrapping: synchronized data extraction
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Toward best-effort information extraction
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Introduction to Information Retrieval
Introduction to Information Retrieval
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
Applied Artificial Intelligence
WebTables: exploring the power of tables on the web
Proceedings of the VLDB Endowment
Supporting the automatic construction of entity aware search engines
Proceedings of the 10th ACM workshop on Web information and data management
Introduction to Algorithms, Third Edition
Introduction to Algorithms, Third Edition
Learning semantic definitions of online information sources
Journal of Artificial Intelligence Research
Open information extraction from the web
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Uninterpreted Schema Matching with Embedded Value Mapping under Opaque Column Names and Data Values
IEEE Transactions on Knowledge and Data Engineering
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Data integration for the relational web
Proceedings of the VLDB Endowment
Automatically Constructing Semantic Web Services from Online Sources
ISWC '09 Proceedings of the 8th International Semantic Web Conference
IEEE Transactions on Knowledge and Data Engineering
Labeling data extracted from the web
OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
Probabilistic models to reconcile complex data from inaccurate data sources
CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction
Proceedings of the VLDB Endowment
Web-scale information extraction with vertex
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
From one tree to a forest: a unified solution for structured web data extraction
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An analysis of structured data on the web
Proceedings of the VLDB Endowment
Open information extraction: the second generation
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One
Truth finding on the deep web: is the problem solved?
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
We present an unsupervised approach for harvesting the data exposed by a set of structured and partially overlapping data-intensive web sources. Our proposal comes within a formal framework tackling two problems: the data extraction problem, to generate extraction rules based on the input websites, and the data integration problem, to integrate the extracted data in a unified schema. We introduce an original algorithm, WEIR, to solve the stated problems and formally prove its correctness. WEIR leverages the overlapping data among sources to make better decisions both in the data extraction (by pruning rules that do not lead to redundant information) and in the data integration (by reflecting local properties of a source over the mediated schema). Along the way, we characterize the amount of redundancy needed by our algorithm to produce a solution, and present experimental results to show the benefits of our approach with respect to existing solutions.