Integrating and querying web databases and documents

  • Authors:
  • Carlos Garcia-Alvarado;Carlos Ordonez

  • Affiliations:
  • University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA

  • Venue:
  • Proceedings of the 20th ACM international conference on Information and knowledge management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

There exist many interrelated information sources on the Internet that can be categorized into structured (database) and semistructured (documents). A key challenge is to integrate, query and analyze such heterogeneous collections of information. In this paper, we defend the idea of building web metadata repositories using relational databases as the main source and central data management technology of structured data, enriched by the semistructured data surrounding it. Our proposal rests on the assumption that heterogeneous relational databases can be integrated (i.e. entity resolution is assumed to work well) and thus can serve as references for external data. That is, we tackle the problem of integrating information in the deep web, departing from databases. We discuss a prototype system that can integrate and query metadata and related documents, based on relational database technology. Metadata includes database ER model elements like database name, table, and column (entity, attribute). Web document data include files, documents and web pages. Links between metadata and external documents are built with SQL queries. Once databases and documents are linked, they are managed and queried with SQL. We discuss an interesting scientific application of our solution with a water pollution database.