Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Reexamining the cluster hypothesis: scatter/gather on retrieval results
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Distributional clustering of words for text classification
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The WebCluster project. Using clustering for mediating access to the World Wide Web
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
ACM Computing Surveys (CSUR)
Document clustering using word clusters via the information bottleneck method
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding out about: a cognitive perspective on search engine technology and the WWW
Finding out about: a cognitive perspective on search engine technology and the WWW
On feature distributional clustering for text categorization
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Modern Information Retrieval
Unsupervised document classification using sequential information maximization
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Distributional word clusters vs. words for text categorization
The Journal of Machine Learning Research
A divisive information theoretic feature clustering algorithm for text classification
The Journal of Machine Learning Research
Distributional clustering of English words
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Cluster-based retrieval using language models
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection
IEEE Transactions on Information Technology in Biomedicine
A relevance feedback mechanism for cluster-based retrieval
Information Processing and Management: an International Journal
Data weaving: scaling up the state-of-the-art in data clustering
Proceedings of the 17th ACM conference on Information and knowledge management
Expert Systems with Applications: An International Journal
An efficient clustering algorithm for large-scale topical web pages
Proceedings of the 18th ACM conference on Information and knowledge management
Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection
IEEE Transactions on Neural Networks
A coarse-to-fine framework to efficiently thwart plagiarism
Pattern Recognition
A multi-level matching method with hybrid similarity for document retrieval
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which automatically discovers contexts of narrow scope within a document corpus. These contexts act as attractors for clustering documents that are semantically related to each other. Once clustered, the documents are organized into a minimum spanning tree so that the topical similarity of adjacent documents within this structure can be assessed. The pre-defined categories from three different document category sets are used to assess the quality of CDC in terms of its ability to group and structure semantically related documents given the contexts. Quality is evaluated based on two factors, the category overlap between adjacent documents within a cluster, and how well a representative document categorizes all the other documents within a cluster. As the RCV1 collection was collated in a time ordered fashion, it was possible to assess the stability of clusters formed from documents within one time interval when presented with new unseen documents at subsequent time intervals. We demonstrate that CDC is a powerful and scaleable technique with the ability to create stable clusters of high quality. Additionally, to our knowledge this is the first time that a collection as large as RCV1 has been analyzed in its entirety using a static clustering approach.