Web text retrieval with a P2P query-driven index

Authors:
Gleb Skobeltsyn;Toan Luu;Ivana Podnar Zarko;Martin Rajman;Karl Aberer
Affiliations:
Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland;Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland;University of Zagreb, Zagreb, Croatia;Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland;Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
Venue:
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2007

Citing 16
Cited 18

Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
On scaling latent semantic indexing for large peer-to-peer systems

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient top-K query calculation in distributed networks

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Progressive Distributed Top-k Retrieval in Peer-to-Peer Networks

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Improving collection selection with overlap awareness in P2P search engines

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The Essence of P2P: A Reference Architecture for Overlay Networks

P2P '05 Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing
Efficient Query Evaluation on Large Textual Collections in a Peer-to-Peer Environment

P2P '05 Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing
Congestion Control for Distributed Hash Tables

NCA '06 Proceedings of the Fifth IEEE International Symposium on Network Computing and Applications
Distributed cache table: efficient query-driven processing of multi-term queries in P2P networks

P2PIR '06 Proceedings of the international workshop on Information retrieval in peer-to-peer networks
Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Query-driven indexing for peer-to-peer text retrieval

Proceedings of the 16th international conference on World Wide Web
Query-driven indexing for scalable peer-to-peer text retrieval

Proceedings of the 2nd international conference on Scalable information systems
Efficient peer-to-peer keyword searching

Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
DL meets p2p – distributed document retrieval based on classification and content

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Federated search of text-based digital libraries in hierarchical peer-to-peer networks

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Data allocation scheme based on term weight for P2P information retrieval

Proceedings of the 9th annual ACM international workshop on Web information and data management
Efficient multi-keyword search over p2p web

Proceedings of the 17th international conference on World Wide Web
Exploiting correlated keywords to improve approximate information filtering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Query-driven indexing for scalable peer-to-peer text retrieval

Future Generation Computer Systems
AlvisP2P: scalable peer-to-peer text retrieval in a structured P2P network

Proceedings of the VLDB Endowment
Adaptive distributed indexing for structured peer-to-peer networks

Proceedings of the 17th ACM conference on Information and knowledge management
Routing of structured queries in large-scale distributed systems

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Efficient query routing by improved peer description in P2P networks

Proceedings of the 3rd international conference on Scalable information systems
Alternatives to conjunctive query processing in peer-to-peer file-sharing systems

Proceedings of the 2009 ACM symposium on Applied Computing
Research Directions in Database Architectures for the Internet of Things: A Communication of the First International Workshop on Database Architectures for the Internet of Things (DAIT 2009)

BNCOD 26 Proceedings of the 26th British National Conference on Databases: Dataspace: The Final Frontier
Probably Approximately Correct Search

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
A scalable and effective full-text search in P2P networks

Proceedings of the 18th ACM conference on Information and knowledge management
7th workshop on large-scale distributed systems for information retrieval (LSDS-IR'09)

ACM SIGIR Forum
Scalability of findability: effective and efficient IR operations in large information networks

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
KMV-peer: a robust and adaptive peer-selection algorithm

Proceedings of the fourth ACM international conference on Web search and data mining
FAST: Friends Augmented Search Techniques - System Design & Data-Management Issues

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Improving query correctness using centralized probably approximately correct (PAC) search

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Studying the clustering paradox and scalability of search in highly distributed environments

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a query-driven indexing/retrieval strategy for efficient full text retrieval from large document collections distributed within a structured P2P network. Our indexing strategy is based on two important properties: (1) the generated distributed index stores posting lists for carefully chosen indexing term combinations, and (2) the posting lists containing too many document references are truncated to a bounded number of their top-ranked elements. These two properties guarantee acceptable storage and bandwidth requirements, essentially because the number of indexing term combinations remains scalable and the transmitted posting lists never exceed a constant size. However, as the number of generated term combinations can still become quite large, we also use term statistics extracted from available query logs to index only such combinations that are frequently present in user queries. Thus, by avoiding the generation of superfluous indexing term combinations, we achieve an additional substantial reduction in bandwidth and storage consumption. As a result, the generated distributed index corresponds to a constantly evolving query-driven indexing structure that efficiently follows current information needs of the users. More precisely, our theoretical analysis and experimental results indicate that, at the price of a marginal loss in retrieval quality for rare queries, the generated index size and network traffic remain manageable even for web-size document collections. Furthermore, our experiments show that at the same time the achieved retrieval quality is fully comparable to the one obtained with a state-of-the-art centralized query engine.