Federated text retrieval from uncooperative overlapped collections

Authors:
Milad Shokouhi;Justin Zobel
Affiliations:
RMIT University, Melbourne, Australia;RMIT University, Melbourne, Australia
Venue:
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2007

Citing 23
Cited 12

The effectiveness of GIOSS for the text database discovery problem

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Server selection on the World Wide Web

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
A language modeling framework for resource selection and results merging

Proceedings of the eleventh international conference on Information and knowledge management
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
Server Ranking for Distributed Text Retrieval Systems on the Internet

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)
Evaluating different methods of estimating retrieval quality for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Comparing the performance of collection selection algorithms

ACM Transactions on Information Systems (TOIS)
A semisupervised learning method to merge search engine results

ACM Transactions on Information Systems (TOIS)
Access-ordered indexes

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Unified utility maximization framework for resource selection

Proceedings of the thirteenth ACM international conference on Information and knowledge management
A two-phase sampling technique for information extraction from hidden web databases

Proceedings of the 6th annual ACM international workshop on Web information and data management
Improving text collection selection with coverage and overlap statistics

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Server selection methods in hybrid portal search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling search engine effectiveness for federated search

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Capturing collection size for distributed non-cooperative retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Distributed query sampling: a quality-conscious approach

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Using query logs to establish vocabularies in distributed information retrieval

Information Processing and Management: an International Journal
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Adaptive query-based sampling of distributed collections

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Sample sizes for query probing in uncooperative distributed information retrieval

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development

Ranking information resources in peer-to-peer text retrieval: an experimental study

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
Learning from past queries for resource selection

Proceedings of the 18th ACM conference on Information and knowledge management
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 20th international conference on World wide web
Federated Search

Foundations and Trends in Information Retrieval
Peer-to-Peer Information Retrieval: An Overview

ACM Transactions on Information Systems (TOIS)
Towards benefit-based RDF source selection for SPARQL queries

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Federated search in the wild: the combined power of over a hundred search engines

Proceedings of the 21st ACM international conference on Information and knowledge management
Studying the clustering paradox and scalability of search in highly distributed environments

ACM Transactions on Information Systems (TOIS)
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Search result diversification in resource selection for federated search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Agreement based source selection for the multi-topic deep web integration

Proceedings of the 17th International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In federated text retrieval systems, the query is sent to multiple collections at the same time. The results returned by collections are gathered and ranked by a central broker that presents them to the user. It is usually assumed that the collections have little overlap. However, in practice collections may share many common documents as either exact or near duplicates, potentially leading to high numbers of duplicates in the final results. Considering the natural band width restrictions and efficiency issues of federated search, sendingqueries to redundant collections leads to unnecessary costs. We propose a novel method for estimating the rate of over-lap among collections based on sampling. Then, using theestimated overlap statistics, we propose two collection selection methods that aim to maximize the number of unique relevant documents in the final results. We show experimentally that, although our estimates of overlap are not in exact, our suggested techniques can significantly improve the search effectiveness when collections overlap.