Scaling Up the TREC Collection

Authors:
David Hawking;Paul Thistlewaite;Donna Harman
Affiliations:
Cooperative Research Centre For Advanced Computational Systems Department Of Computer Science, Australian National University, Canberra ACT 0200 Australia. david.hawking@cmis.csiro.au;Cooperative Research Centre For Advanced Computational Systems Department Of Computer Science, Australian National University, Canberra ACT 0200 Australia. paul.thistlewaite@cs.anu.edu.au;National Institute of Standards and Technology Gaithersburg MD 20899. donna.harman@nist.gov
Venue:
Information Retrieval
Year:
1999

Citing 5
Cited 6

Programming perl

Programming perl
Document filtering for fast ranking

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Scalable Text Retrieval for Large Digital Libraries

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries

Measuring Search Engine Quality

Information Retrieval
Construction of query concepts based on feature clustering of documents

Information Retrieval
Extreme value theory applied to document retrieval from large collections

Information Retrieval
Multilingual PRF: english lends a helping hand

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Multilingual pseudo-relevance feedback: performance study of assisting languages

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Scalability influence on retrieval models: an experimental methodology

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the popularity of Web search engines, a large proportion ofreal text retrieval queries are now processed over collections measured in tens or hundredsof gigabytes. A new Very Large test Collection (VLC) has been created tosupport qualification, measurement and comparison of systems operatingat this level and to permit the study of the properties of very largecollections. The VLC is an extension of the well-known TRECcollection and has been distributed under the same conditions.A simple set of efficiency and effectiveness measures have been defined to encourage comparability of reporting.The 20 gigabyte first-edition of the VLC and a representative 10%sample have been used in a special interest track of the 1997 TextRetrieval Conference (TREC-6).The unaffordable cost of obtaining complete relevanceassessments over collections of this scale is avoided by concentratingon early precision and relying on the core TREC collection to supportdetailed effectiveness studies.Results obtained by TREC-6 VLC track participants are presented here.All groups observed a significant increase in early precision as collection size increased. Explanatory hypotheses are advanced forfuture empirical testing. A 100 gigabyte second edition VLC (VLC2) has recently been compiledand distributed for use in TREC-7 in 1998.