Measuring the complexity of a collection of documents

  • Authors:
  • Vishwa Vinay;Ingemar J. Cox;Natasa Milic-Frayling;Ken Wood

  • Affiliations:
  • Department of Computer Science, University College London, UK;Department of Computer Science, University College London, UK;Microsoft Research Ltd, Cambridge, UK;Microsoft Research Ltd, Cambridge, UK

  • Venue:
  • ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Some text collections are more difficult to search or more complex to organize into topics than others. What properties of the data characterize this complexity? We use a variation of the Cox-Lewis statistic to measure the natural tendency of a set of points to fall into clusters. We compute this quantity for document collections that are represented as a set of term vectors. We consider applications of the Cox-Lewis statistic in three scenarios: comparing clusterability of different text collections using the same representation, comparing different representations of the same text collection, and predicting the query performance based on the clusterability of the query results set. Our experimental results show a correlation between the observed effectiveness and this statistic, thereby demonstrating the utility of such data analysis in text retrieval.