A scaleable document clustering approach for large document corpora

Authors:
Niall Rooney;David Patterson;Mykola Galushka;Vladimir Dobrynin
Affiliations:
Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey, UK;Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey, UK;Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey, UK;Faculty of Applied Mathematics and Control Processes, St. Petersburg State University, St. Petersburg, Russia
Venue:
Information Processing and Management: an International Journal
Year:
2006

Citing 20
Cited 7

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The WebCluster project. Using clustering for mediating access to the World Wide Web

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Data clustering: a review

ACM Computing Surveys (CSUR)
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding out about: a cognitive perspective on search engine technology and the WWW

Finding out about: a cognitive perspective on search engine technology and the WWW
On feature distributional clustering for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Unsupervised document classification using sequential information maximization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Non-Redundant Data Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection

IEEE Transactions on Information Technology in Biomedicine

A relevance feedback mechanism for cluster-based retrieval

Information Processing and Management: an International Journal
Data weaving: scaling up the state-of-the-art in data clustering

Proceedings of the 17th ACM conference on Information and knowledge management
A new document representation using term frequency and vectorized graph connectionists with application to document retrieval

Expert Systems with Applications: An International Journal
An efficient clustering algorithm for large-scale topical web pages

Proceedings of the 18th ACM conference on Information and knowledge management
Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection

IEEE Transactions on Neural Networks
A coarse-to-fine framework to efficiently thwart plagiarism

Pattern Recognition
A multi-level matching method with hybrid similarity for document retrieval

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which automatically discovers contexts of narrow scope within a document corpus. These contexts act as attractors for clustering documents that are semantically related to each other. Once clustered, the documents are organized into a minimum spanning tree so that the topical similarity of adjacent documents within this structure can be assessed. The pre-defined categories from three different document category sets are used to assess the quality of CDC in terms of its ability to group and structure semantically related documents given the contexts. Quality is evaluated based on two factors, the category overlap between adjacent documents within a cluster, and how well a representative document categorizes all the other documents within a cluster. As the RCV1 collection was collated in a time ordered fashion, it was possible to assess the stability of clusters formed from documents within one time interval when presented with new unseen documents at subsequent time intervals. We demonstrate that CDC is a powerful and scaleable technique with the ability to create stable clusters of high quality. Additionally, to our knowledge this is the first time that a collection as large as RCV1 has been analyzed in its entirety using a static clustering approach.