Techniques for the measurement of clustering tendency in document retrieval systems
Journal of Information Science
Algorithms for clustering data
Algorithms for clustering data
Latent semantic indexing: a probabilistic analysis
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Information Retrieval
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces
ICDT '01 Proceedings of the 8th International Conference on Database Theory
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Query-sensitive similarity measures for information retrieval
Knowledge and Information Systems
Testing for Uniformity in Multidimensional Data
IEEE Transactions on Pattern Analysis and Machine Intelligence
Expectation-propagation for the generative aspect model
UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Visualizing and Evaluating Complexity of Textual Case Bases
ECCBR '08 Proceedings of the 9th European conference on Advances in Case-Based Reasoning
Evaluation Measures for TCBR Systems
ECCBR '08 Proceedings of the 9th European conference on Advances in Case-Based Reasoning
Robust Measures of Complexity in TCBR
ICCBR '09 Proceedings of the 8th International Conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Using correlation dimension for analysing text data
ICANN'10 Proceedings of the 20th international conference on Artificial neural networks: Part I
Progress in information retrieval
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Hi-index | 0.00 |
Some text collections are more difficult to search or more complex to organize into topics than others. What properties of the data characterize this complexity? We use a variation of the Cox-Lewis statistic to measure the natural tendency of a set of points to fall into clusters. We compute this quantity for document collections that are represented as a set of term vectors. We consider applications of the Cox-Lewis statistic in three scenarios: comparing clusterability of different text collections using the same representation, comparing different representations of the same text collection, and predicting the query performance based on the clusterability of the query results set. Our experimental results show a correlation between the observed effectiveness and this statistic, thereby demonstrating the utility of such data analysis in text retrieval.