A scaleable document clustering approach for large document corpora

  • Authors:
  • Niall Rooney;David Patterson;Mykola Galushka;Vladimir Dobrynin

  • Affiliations:
  • Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey, UK;Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey, UK;Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey, UK;Faculty of Applied Mathematics and Control Processes, St. Petersburg State University, St. Petersburg, Russia

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which automatically discovers contexts of narrow scope within a document corpus. These contexts act as attractors for clustering documents that are semantically related to each other. Once clustered, the documents are organized into a minimum spanning tree so that the topical similarity of adjacent documents within this structure can be assessed. The pre-defined categories from three different document category sets are used to assess the quality of CDC in terms of its ability to group and structure semantically related documents given the contexts. Quality is evaluated based on two factors, the category overlap between adjacent documents within a cluster, and how well a representative document categorizes all the other documents within a cluster. As the RCV1 collection was collated in a time ordered fashion, it was possible to assess the stability of clusters formed from documents within one time interval when presented with new unseen documents at subsequent time intervals. We demonstrate that CDC is a powerful and scaleable technique with the ability to create stable clusters of high quality. Additionally, to our knowledge this is the first time that a collection as large as RCV1 has been analyzed in its entirety using a static clustering approach.