An investigation into the stability of contextual document clustering

Authors:
Niall Rooney;David Patterson;Mykola Galushka;Vladimir Dobrynin;Elena Smirnova
Affiliations:
Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey BT37 OQB, United Kingdom;Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey BT37 OQB, United Kingdom;Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey BT37 OQB, United Kingdom;Faculty of Applied Mathematics & Control Processes, St. Petersburg State University, 35 University Ave., St. Petersburg 198504, Russia;Faculty of Applied Mathematics & Control Processes, St. Petersburg State University, 35 University Ave., St. Petersburg 198504, Russia
Venue:
Journal of the American Society for Information Science and Technology
Year:
2008

Citing 16
Cited 0

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental clustering for dynamic information processing

ACM Transactions on Information Systems (TOIS)
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental clustering for very large document databases: initial MARIAN experience

Information Sciences—Informatics and Computer Science: An International Journal
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
The effectiveness of query-specific hierarchic clustering in information retrieval

Information Processing and Management: an International Journal
On the quality of ART1 text clustering

Neural Networks - 2003 Special issue: Advances in neural networks research — IJCNN'03
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Sophia: a novel approach for textual case-based reasoning

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection

IEEE Transactions on Information Technology in Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article, we assess the effectiveness of Contextual Document Clustering (CDC) as a means of indexing within a dynamic and rapidly changing environment. We simulate a dynamic environment, by splitting two chronologically ordered datasets into time-ordered segments and assessing how the technique performs under two different scenarios. The first is when new documents are added incrementally without reclustering [incremental CDC (iCDC)], and the second is when reclustering is performed [nonincremental CDC (nCDC)]. The datasets are very large, are independent of each other, and belong to two very different domains. We show that CDC itself is effective at clustering very large document corpora, and that, significantly, it lends itself to a very simple, efficient incremental document addition process that is seen to be very stable over time despite the size of the corpus growing considerably. It was seen to be effective at incrementally clustering new documents even when the corpus grew to six times its original size. This is in contrast to what other researchers have found when applying similar simple incremental approaches to document clustering. The stability of iCDC is accounted for by the unique manner in which CDC discovers cluster themes. © 2008 Wiley Periodicals, Inc.