Do TREC web collections look like the web?

Authors:
Ian Soboroff
Affiliations:
National Institute of Standards and Technology, Gaithersburg, MD
Venue:
ACM SIGIR Forum
Year:
2002

Citing 3
Cited 9

Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
A case study in web search using TREC algorithms

Proceedings of the 10th international conference on World Wide Web
Engineering a multi-purpose test collection for web retrieval experiments

Information Processing and Management: an International Journal

Documents and queries as random variables: History and implications: Research Articles

Journal of the American Society for Information Science and Technology
Navigationaided retrieval

Proceedings of the 16th international conference on World Wide Web
Using similarity links as shortcuts to relevant web pages

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for evaluating reliability of IR test collections

Information Processing and Management: an International Journal
Is Wikipedia link structure different?

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Correlation of Term Count and Document Frequency for Google N-Grams

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Relevance propagation model for large hypertext document collections

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
A systematic study of parameter correlations in large scale duplicate document detection

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
A path-based approach for web page retrieval

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

We measure the WT10g test collection, used in the TREC-9 and TREC 2001 Web Tracks, and the .GOV test collection used in the TREC 2002 Web and Interactive Tracks, with common measures used in the web topology community, in order to see if these collections "look like" the web. This is not an idle question; characteristics of the web, such as power law relationships, diameter, and connected components have all been observed within the scope of general web crawls, constructed by blindly following links. The .GOV collection is a fairly straightforward 18GB crawl of sites in the .gov domain. In contrast, WT10g was carved out from a much larger crawl specifically to be a web search test collection within the reach of university researchers. Do such collections retain the properties of the larger web? In the case of WT10g and .GOV, yes.