Semantically-guided clustering of text documents via frequent subgraphs discovery

Authors:
Rafal A. Angryk;M. Shahriar Hossain;Brandon Norick
Affiliations:
Department of Computer Science, Montana State University, Bozeman, MT;Department of Computer Science, Virginia Tech, Blacksburg, VA;Department of Computer Science, Montana State University, Bozeman, MT
Venue:
ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Year:
2011

Citing 15
Cited 1

Optimizing convenient online access to bibliographic databases

Information Services and Use
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Discovery of Multiple-Level Association Rules from Large Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Applying the Subdue Substructure Discovery System to the Chemical Toxicity Domain

Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference
gSpan: Graph-Based Substructure Pattern Mining

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Diagonally Subgraphs Pattern Mining

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
A study of topic similarity measures

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
An Efficient Algorithm for Discovering Frequent Subgraphs

IEEE Transactions on Knowledge and Data Engineering
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Subdue: compression-based frequent pattern discovery in graph data

Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations
YALE: rapid prototyping for complex data mining tasks

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
GDClust: A Graph-Based Document Clustering Technique

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Knowledge map creation and maintenance for virtual communities of practice

Information Processing and Management: an International Journal

Abstracting for Dimensionality Reduction in Text Classification

International Journal of Intelligent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we introduce and analyze two improvements to GDClust [1], a system for document clustering based on the co-occurrence of frequent subgraphs. GDClust (Graph-Based Document Clustering) works with frequent senses derived from the constraints provided by the natural language rather than working with the co-occurrences of frequent keywords commonly used in the vector space model (VSM) of document clustering. Text documents are transformed to hierarchical document-graphs, and an efficient graph-mining technique is used to find frequent subgraphs. Discovered frequent subgraphs are then utilized to generate accurate sense-based document clusters. In this paper, we introduce two novel mechanisms called the Subgraph-Extension Generator (SEG) and the Maximum Subgraph-Extension Generator (MaxSEG) which directly utilize constraints from the natural language to reduce the number of candidates and the overhead imposed by our first implementation of GDClust.