Efficient similarity-based declustering techniques for keyword-based information retrieval in the streaming data model

  • Authors:
  • Sanjiv Behl;Rakesh M. Verma

  • Affiliations:
  • University of Houston-Victoria, Victoria, TX;University of Houston, Houston, TX

  • Venue:
  • PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multiple-disk architectures are an attractive approach to meet high performance I/O demands in I/O intensive applications such as search engines, web servers and information retrieval systems. This requires that the issues of dynamic load balancing and access parallelism be addressed, which is the goal of this paper. We address the problem of document declustering in a keyword-based information retrieval system for parallel architectures consisting of a single processor and multiple disks. We propose and evaluate experimentally four similarity-based methods, viz., set, multiset, vector, and euclidean, for declustering documents. Interestingly, our results show that for single keyword queries as well as boolean and queries the set and multiset methods generally outperform the vector and euclidean methods with set being the best for the so-called simple plan. We also introduce a highest-frequency first retrieval scenario and compare the methods under this scenario, and find that set and multiset methods are still generally superior to the other methods with the multiset outperforming the set method. We compare these methods with the (theoretically) optimal values, which are practically impossible to achieve. Finally, we approximated the multiset method using the harmonic mean and found that the results were slightly inferior than multiset method, but still better than the vector and euclidean methods.