Pruning long documents for distributed information retrieval

Authors:
Jie Lu;Jamie Callan
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the eleventh international conference on Information and knowledge management
Year:
2002

Citing 8
Cited 14

Searching distributed collections with inference networks

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Applying summarization techniques for term selection in relevance feedback

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Generic summaries for indexing in information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The effectiveness of query expansion for distributed information retrieval

Proceedings of the tenth international conference on Information and knowledge management
Using sampled data and regression to merge search engine results

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Server Ranking for Distributed Text Retrieval Systems on the Internet

Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA)

Content-based retrieval in hybrid peer-to-peer networks

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Evaluating profiling and query expansion methods for P2P information retrieval

Proceedings of the 2005 ACM workshop on Information retrieval in peer-to-peer networks
Reducing storage costs for federated search of text databases

dg.o '03 Proceedings of the 2003 annual national conference on Digital government research
Collaborative research - digital government: a language modeling approach to metadata for cross-database linkage and search

dg.o '04 Proceedings of the 2004 annual national conference on Digital government research
An evaluation of resource description quality measures

Proceedings of the 2006 ACM symposium on Applied computing
Towards better measures: evaluation of estimated resource description quality for distributed IR

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Search and browse services for heterogeneous collections with the peer-to-peer network Pepper

Information Processing and Management: an International Journal
Using query logs to establish vocabularies in distributed information retrieval

Information Processing and Management: an International Journal
Metadata harvesting for content-based distributed information retrieval

Journal of the American Society for Information Science and Technology
Ranking information resources in peer-to-peer text retrieval: an experimental study

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Robust result merging using sample-based score estimates

ACM Transactions on Information Systems (TOIS)
Document Compaction for Efficient Query Biased Snippet Generation

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Caching query-biased snippets for efficient retrieval

Proceedings of the 14th International Conference on Extending Database Technology
Federated Search

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Query-based sampling is a method of discovering the contents of a text database by submitting queries to a search engine and observing the documents returned. In prior research sampled documents were used to build resource descriptions for automatic database selection, and to build a centralized sample database for query expansion and result merging. An unstated assumption was that the associated storage costs were acceptable.When sampled documents are long, storage costs can be large. This paper investigates methods of pruning long documents to reduce storage costs. The experimental results demonstrate that building resource descriptions and centralized sample databases from the pruned contents of sampled documents can reduce storage costs by 54-93% while causing only minor losses in the accuracy of distributed information retrieval.