Global term weights in distributed environments

Authors:
Hans Friedrich Witschel
Affiliations:
University of Leipzig, NLP Department, P.O. Box 100920, D-04009 Leipzig, Germany
Venue:
Information Processing and Management: an International Journal
Year:
2008

Citing 15
Cited 5

Dissemination of collection wide information in a distributed information retrieval system

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
On the update of term weights in dynamic information retrieval systems

CIKM '95 Proceedings of the fourth international conference on Information and knowledge management
Document filtering with inference networks

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A study of retrospective and on-line event detection

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Comparing the performance of database selection algorithms

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
pSearch: information retrieval in structured overlays

ACM SIGCOMM Computer Communication Review
PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
A System for new event detection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
On-Line New Event Detection using Single Pass Clustering TITLE2:

On-Line New Event Detection using Single Pass Clustering TITLE2:
On the design of reliable efficient information systems

On the design of reliable efficient information systems
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Aggregation of a term vocabulary for P2P-IRtest: a DHT stress test

DBISP2P'05/06 Proceedings of the 2005/2006 international conference on Databases, information systems, and peer-to-peer computing

Ranking information resources in peer-to-peer text retrieval: an experimental study

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Aggregation of Document Frequencies in Unstructured P2P Networks

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
An evaluation measure for distributed information retrieval systems

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
A hybrid approach for estimating document frequencies in unstructured P2P networks

Information Systems
Relevance weighting using within-document term statistics

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated. The results show that very good retrieval performance can be reached when just the most frequent terms of a collection - an ''extended stop word list'' - are known and all terms which are not in that list are treated equally. However, the list cannot always be fully estimated from a general-purpose reference corpus, but some ''domain-specific stop words'' need to be added. A good solution for achieving this is to mix estimates from small samples of the target retrieval collection with ones derived from a reference corpus.