Data integration for the relational web

Authors:
Michael J. Cafarella;Alon Halevy;Nodira Khoussainova
Affiliations:
University of Washington, Seattle, WA;Google, Inc., Mountain View, CA;University of Washington, Seattle, WA
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 13
Cited 30

Navigational plans for data integration

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Building data integration queries by demonstration

Proceedings of the 12th international conference on Intelligent user interfaces
Making mashups with marmite: towards end-user programming for the web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Building structured web community portals: a top-down, compositional, and incremental approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment

Querying the deep web

Proceedings of the 13th International Conference on Extending Database Technology
Automatic extraction of clickable structured web contents for name entity queries

Proceedings of the 19th international conference on World wide web
Entity relation discovery from web tables and links

Proceedings of the 19th international conference on World wide web
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Redundancy-driven web data extraction and integration

Procceedings of the 13th International Workshop on the Web and Databases
Structured data on the web

Communications of the ACM
Materializing multi-relational databases from the web using taxonomic queries

Proceedings of the fourth ACM international conference on Web search and data mining
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Automatic example queries for ad hoc databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
OSD-DB: a military logistics mobile database

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment
Database-as-a-service for long-tail science

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
A secured collaborative model for data integration in life sciences

Transactions on large-scale data- and knowledge-centered systems IV
A novel measure of edge centrality in social networks

Knowledge-Based Systems
Chapter 7: dataspaces

Search Computing
Sample-driven schema mapping

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Finding related tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Using Bayesian networks theory for aggregated search to XML retrieval

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Answering table queries on the web using column keywords

Proceedings of the VLDB Endowment
Real-time population of knowledge bases: opportunities and challenges

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Mix-n-Match: building personal libraries from web content

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Robust web data extraction: a novel approach based on minimum cost script edit model

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
Aggregated search: A new information retrieval paradigm

ACM Computing Surveys (CSUR)
Extraction and integration of partially overlapping web sources

Proceedings of the VLDB Endowment
Schema extraction for tabular data on the web

Proceedings of the VLDB Endowment
Towards the integration of images on the Web

Proceedings of International Conference on Information Integration and Web-based Applications & Services
Synthesizing union tables from the web

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.02

Visualization

Abstract

The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tools. First, the structured data is usually not published cleanly and must be extracted (say, from an HTML list) before it can be used. Second, due to the vastness of the corpus, a user can never know all of the potentially-relevant databases ahead of time (much less write a wrapper or mapping for each one); the source databases must be discovered during the integration process. Third, some of the important information regarding the data is only present in its enclosing web page and needs to be extracted appropriately. This paper describes Octopus, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web. The key idea underlying Octopus is to offer the user a set of best-effort operators that automate the most labor-intensive tasks. For example, the Search operator takes a search-style keyword query and returns a set of relevance-ranked and similarity-clustered structured data sources on the Web; the Context operator helps the user specify the semantics of the sources by inferring attribute values that may not appear in the source itself, and the Extend operator helps the user find related sources that can be joined to add new attributes to a table. Octopus executes some of these operators automatically, but always allows the user to provide feedback and correct errors. We describe the algorithms underlying each of these operators and experiments that demonstrate their efficacy.