Efficient similarity-based declustering techniques for keyword-based information retrieval in the streaming data model

Authors:
Sanjiv Behl;Rakesh M. Verma
Affiliations:
University of Houston-Victoria, Victoria, TX;University of Houston, Houston, TX
Venue:
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Year:
2007

Citing 12
Cited 0

Inverted File Partitioning Schemes in Multiple Disk Systems

IEEE Transactions on Parallel and Distributed Systems
Efficient declustering techniques for temporal access structures

ADC '01 Proceedings of the 12th Australasian database conference
Information Retrieval: Algorithms and Heuristics

Information Retrieval: Algorithms and Heuristics
Modern Information Retrieval

Modern Information Retrieval
Query processing and inverted indices in shared: nothing text document information retrieval systems

The VLDB Journal — The International Journal on Very Large Data Bases - Parallelism in database systems
LoT: Dynamic Declustering of TSB-Tree Nodes for Parallel Access to Temporal Data

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Optimal Allocation of Two-Dimensional Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Multidimensional Declustering Schemes Using Golden Ratio and Kronecker Sequences

IEEE Transactions on Knowledge and Data Engineering
From discrepancy to declustering: Near-optimal multidimensional declustering strategies for range queries

Journal of the ACM (JACM)
Adaptive Overlapped Declustering: A Highly Available Data-Placement Method Balancing Access Load and Space Utilization

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Improved bounds and schemes for the declustering problem

Theoretical Computer Science
Threshold-based declustering

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multiple-disk architectures are an attractive approach to meet high performance I/O demands in I/O intensive applications such as search engines, web servers and information retrieval systems. This requires that the issues of dynamic load balancing and access parallelism be addressed, which is the goal of this paper. We address the problem of document declustering in a keyword-based information retrieval system for parallel architectures consisting of a single processor and multiple disks. We propose and evaluate experimentally four similarity-based methods, viz., set, multiset, vector, and euclidean, for declustering documents. Interestingly, our results show that for single keyword queries as well as boolean and queries the set and multiset methods generally outperform the vector and euclidean methods with set being the best for the so-called simple plan. We also introduce a highest-frequency first retrieval scenario and compare the methods under this scenario, and find that set and multiset methods are still generally superior to the other methods with the multiset outperforming the set method. We compare these methods with the (theoretically) optimal values, which are practically impossible to achieve. Finally, we approximated the multiset method using the harmonic mean and found that the results were slightly inferior than multiset method, but still better than the vector and euclidean methods.