Navigational plans for data integration
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Reconciling schemas of disparate data sources: a machine-learning approach
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Web-scale information extraction in knowitall: (preliminary results)
Proceedings of the 13th international conference on World Wide Web
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Building data integration queries by demonstration
Proceedings of the 12th international conference on Intelligent user interfaces
Making mashups with marmite: towards end-user programming for the web
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Building structured web community portals: a top-down, compositional, and incremental approach
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
WebTables: exploring the power of tables on the web
Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Proceedings of the 13th International Conference on Extending Database Technology
Automatic extraction of clickable structured web contents for name entity queries
Proceedings of the 19th international conference on World wide web
Entity relation discovery from web tables and links
Proceedings of the 19th international conference on World wide web
From information to knowledge: harvesting entities and relationships from web sources
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Redundancy-driven web data extraction and integration
Procceedings of the 13th International Workshop on the Web and Databases
Communications of the ACM
Materializing multi-relational databases from the web using taxonomic queries
Proceedings of the fourth ACM international conference on Web search and data mining
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
Automatic example queries for ad hoc databases
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
OSD-DB: a military logistics mobile database
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Recovering semantics of tables on the web
Proceedings of the VLDB Endowment
Database-as-a-service for long-tail science
SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
A secured collaborative model for data integration in life sciences
Transactions on large-scale data- and knowledge-centered systems IV
A novel measure of edge centrality in social networks
Knowledge-Based Systems
Search Computing
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Using Bayesian networks theory for aggregated search to XML retrieval
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Answering table queries on the web using column keywords
Proceedings of the VLDB Endowment
Real-time population of knowledge bases: opportunities and challenges
AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Mix-n-Match: building personal libraries from web content
TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Robust web data extraction: a novel approach based on minimum cost script edit model
WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The parallel path framework for entity discovery on the web
ACM Transactions on the Web (TWEB)
Aggregated search: A new information retrieval paradigm
ACM Computing Surveys (CSUR)
Extraction and integration of partially overlapping web sources
Proceedings of the VLDB Endowment
Schema extraction for tabular data on the web
Proceedings of the VLDB Endowment
Towards the integration of images on the Web
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Synthesizing union tables from the web
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Hi-index | 0.02 |
The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tools. First, the structured data is usually not published cleanly and must be extracted (say, from an HTML list) before it can be used. Second, due to the vastness of the corpus, a user can never know all of the potentially-relevant databases ahead of time (much less write a wrapper or mapping for each one); the source databases must be discovered during the integration process. Third, some of the important information regarding the data is only present in its enclosing web page and needs to be extracted appropriately. This paper describes Octopus, a system that combines search, extraction, data cleaning and integration, and enables users to create new data sets from those found on the Web. The key idea underlying Octopus is to offer the user a set of best-effort operators that automate the most labor-intensive tasks. For example, the Search operator takes a search-style keyword query and returns a set of relevance-ranked and similarity-clustered structured data sources on the Web; the Context operator helps the user specify the semantics of the sources by inferring attribute values that may not appear in the source itself, and the Extend operator helps the user find related sources that can be joined to add new attributes to a table. Octopus executes some of these operators automatically, but always allows the user to provide feedback and correct errors. We describe the algorithms underlying each of these operators and experiments that demonstrate their efficacy.