Measuring the complexity of a collection of documents

Authors:
Vishwa Vinay;Ingemar J. Cox;Natasa Milic-Frayling;Ken Wood
Affiliations:
Department of Computer Science, University College London, UK;Department of Computer Science, University College London, UK;Microsoft Research Ltd, Cambridge, UK;Microsoft Research Ltd, Cambridge, UK
Venue:
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Year:
2006

Citing 9
Cited 5

Techniques for the measurement of clustering tendency in document retrieval systems

Journal of Information Science
Algorithms for clustering data

Algorithms for clustering data
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Information Retrieval

Information Retrieval
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Query-sensitive similarity measures for information retrieval

Knowledge and Information Systems
Testing for Uniformity in Multidimensional Data

IEEE Transactions on Pattern Analysis and Machine Intelligence
Expectation-propagation for the generative aspect model

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

Visualizing and Evaluating Complexity of Textual Case Bases

ECCBR '08 Proceedings of the 9th European conference on Advances in Case-Based Reasoning
Evaluation Measures for TCBR Systems

ECCBR '08 Proceedings of the 9th European conference on Advances in Case-Based Reasoning
Robust Measures of Complexity in TCBR

ICCBR '09 Proceedings of the 8th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Using correlation dimension for analysing text data

ICANN'10 Proceedings of the 20th international conference on Artificial neural networks: Part I
Progress in information retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Some text collections are more difficult to search or more complex to organize into topics than others. What properties of the data characterize this complexity? We use a variation of the Cox-Lewis statistic to measure the natural tendency of a set of points to fall into clusters. We compute this quantity for document collections that are represented as a set of term vectors. We consider applications of the Cox-Lewis statistic in three scenarios: comparing clusterability of different text collections using the same representation, comparing different representations of the same text collection, and predicting the query performance based on the clusterability of the query results set. Our experimental results show a correlation between the observed effectiveness and this statistic, thereby demonstrating the utility of such data analysis in text retrieval.