Representing document as dependency graph for document clustering

Authors:
Yujing Wang;Xiaochuan Ni;Jian-Tao Sun;Yunhai Tong;Zheng Chen
Affiliations:
Peking University, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Peking University, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 22
Cited 0

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Algorithms for Shortest Paths in Sparse Networks

Journal of the ACM (JACM)
A vector space model for automatic indexing

Communications of the ACM
An experimental comparison of model-based clustering methods

Machine Learning
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Generative model-based clustering of directional data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient multi-way text categorization via generalized discriminant analysis

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Three new probabilistic models for dependency parsing: an exploration

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Document clustering by concept factorization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering via adaptive subspace iteration

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Accurate unlexicalized parsing

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A new suffix tree similarity measure for document clustering

Proceedings of the 16th international conference on World Wide Web
Density-based clustering for real-time stream data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A tutorial on spectral clustering

Statistics and Computing
Model-based document clustering with a collapsed gibbs sampler

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering via dirichlet process mixture model with feature selection

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Survey: Graph clustering

Computer Science Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

In traditional clustering methods, a document is often represented as "bag of words" (in BOW model) or n-grams (in suffix tree document model) without considering the natural language relationships between the words. In this paper, we propose a novel approach DGDC (Dependency Graph-based Document Clustering algorithm) to address this issue. In our algorithm, each document is represented as a dependency graph where the nodes correspond to words which can be seen as meta-descriptions of the document; whereas the edges stand for the relations between pairs of words. A new similarity measure is proposed to compute the pairwise similarity of documents based on their corresponding dependency graphs. By applying the new similarity measure in the Group-average Agglomerative Hierarchial Clustering (GAHC) algorithm, the final clusters of documents can be obtained. The experiments were carried out on five public document datasets. The empirical results have indicated that the DGDC algorithm can achieve better performance in document clustering tasks compared with other approaches based on the BOW model and suffix tree document model.