Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental clustering for dynamic information processing
ACM Transactions on Information Systems (TOIS)
OHSUMED: an interactive retrieval evaluation and new large test collection for research
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental clustering for very large document databases: initial MARIAN experience
Information Sciences—Informatics and Computer Science: An International Journal
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Information Retrieval
Modern Information Retrieval
The effectiveness of query-specific hierarchic clustering in information retrieval
Information Processing and Management: an International Journal
On the quality of ART1 text clustering
Neural Networks - 2003 Special issue: Advances in neural networks research IJCNN'03
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Cluster-based retrieval using language models
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Phrase-Based Document Indexing for Web Document Clustering
IEEE Transactions on Knowledge and Data Engineering
Sophia: a novel approach for textual case-based reasoning
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection
IEEE Transactions on Information Technology in Biomedicine
Hi-index | 0.00 |
In this article, we assess the effectiveness of Contextual Document Clustering (CDC) as a means of indexing within a dynamic and rapidly changing environment. We simulate a dynamic environment, by splitting two chronologically ordered datasets into time-ordered segments and assessing how the technique performs under two different scenarios. The first is when new documents are added incrementally without reclustering [incremental CDC (iCDC)], and the second is when reclustering is performed [nonincremental CDC (nCDC)]. The datasets are very large, are independent of each other, and belong to two very different domains. We show that CDC itself is effective at clustering very large document corpora, and that, significantly, it lends itself to a very simple, efficient incremental document addition process that is seen to be very stable over time despite the size of the corpus growing considerably. It was seen to be effective at incrementally clustering new documents even when the corpus grew to six times its original size. This is in contrast to what other researchers have found when applying similar simple incremental approaches to document clustering. The stability of iCDC is accounted for by the unique manner in which CDC discovers cluster themes. © 2008 Wiley Periodicals, Inc.