Scaling Up the TREC Collection

  • Authors:
  • David Hawking;Paul Thistlewaite;Donna Harman

  • Affiliations:
  • Cooperative Research Centre For Advanced Computational Systems Department Of Computer Science, Australian National University, Canberra ACT 0200 Australia. david.hawking@cmis.csiro.au;Cooperative Research Centre For Advanced Computational Systems Department Of Computer Science, Australian National University, Canberra ACT 0200 Australia. paul.thistlewaite@cs.anu.edu.au;National Institute of Standards and Technology Gaithersburg MD 20899. donna.harman@nist.gov

  • Venue:
  • Information Retrieval
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the popularity of Web search engines, a large proportion ofreal text retrieval queries are now processed over collections measured in tens or hundredsof gigabytes. A new Very Large test Collection (VLC) has been created tosupport qualification, measurement and comparison of systems operatingat this level and to permit the study of the properties of very largecollections. The VLC is an extension of the well-known TRECcollection and has been distributed under the same conditions.A simple set of efficiency and effectiveness measures have been defined to encourage comparability of reporting.The 20 gigabyte first-edition of the VLC and a representative 10%sample have been used in a special interest track of the 1997 TextRetrieval Conference (TREC-6).The unaffordable cost of obtaining complete relevanceassessments over collections of this scale is avoided by concentratingon early precision and relying on the core TREC collection to supportdetailed effectiveness studies.Results obtained by TREC-6 VLC track participants are presented here.All groups observed a significant increase in early precision as collection size increased. Explanatory hypotheses are advanced forfuture empirical testing. A 100 gigabyte second edition VLC (VLC2) has recently been compiledand distributed for use in TREC-7 in 1998.