Information retrieval from digital libraries in SQL

  • Authors:
  • Carlos Garcia-Alvarado;Carlos Ordonez

  • Affiliations:
  • University of Houston, Houston, TX, USA;University of Houston, Houston, TX, USA

  • Venue:
  • Proceedings of the 10th ACM workshop on Web information and data management
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information retrieval techniques have been traditionally exploited outside of relational database systems, due to storage overhead, the complexity of programming them inside the database system, and their slow performance in SQL implementations. This project supports the idea that searching and querying digital libraries with information retrieval models in relational database systems can be performed with optimized SQL queries and User-Defined Functions. In our research, we propose several techniques divided into two phases: storing and retrieving. The storing phase includes executing document pre-processing, stop-word removal and term extraction, and the retrieval phase is implemented with three fundamental IR models: the popular Vector Space Model, the Okapi Probabilistic Model, and the Dirichlet Prior Language Model. We conduct experiments using article abstracts from the DBLP bibliography and the ACM Digital Library. We evaluate several query optimizations, compare the on-demand and the static weighting approaches, and we study the performance with conjunctive and disjunctive queries with the three ranking models. Our prototype proved to have linear scalability and a satisfactory performance with medium-sized document collections. Our implementation of the Vector Space Model is competitive with the two other models.