Semantically-guided clustering of text documents via frequent subgraphs discovery

  • Authors:
  • Rafal A. Angryk;M. Shahriar Hossain;Brandon Norick

  • Affiliations:
  • Department of Computer Science, Montana State University, Bozeman, MT;Department of Computer Science, Virginia Tech, Blacksburg, VA;Department of Computer Science, Montana State University, Bozeman, MT

  • Venue:
  • ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we introduce and analyze two improvements to GDClust [1], a system for document clustering based on the co-occurrence of frequent subgraphs. GDClust (Graph-Based Document Clustering) works with frequent senses derived from the constraints provided by the natural language rather than working with the co-occurrences of frequent keywords commonly used in the vector space model (VSM) of document clustering. Text documents are transformed to hierarchical document-graphs, and an efficient graph-mining technique is used to find frequent subgraphs. Discovered frequent subgraphs are then utilized to generate accurate sense-based document clusters. In this paper, we introduce two novel mechanisms called the Subgraph-Extension Generator (SEG) and the Maximum Subgraph-Extension Generator (MaxSEG) which directly utilize constraints from the natural language to reduce the number of candidates and the overhead imposed by our first implementation of GDClust.