Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Algorithms for Shortest Paths in Sparse Networks
Journal of the ACM (JACM)
A vector space model for automatic indexing
Communications of the ACM
An experimental comparison of model-based clustering methods
Machine Learning
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Document clustering based on non-negative matrix factorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Generative model-based clustering of directional data
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient multi-way text categorization via generalized discriminant analysis
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Three new probabilistic models for dependency parsing: an exploration
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Document clustering by concept factorization
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering via adaptive subspace iteration
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Phrase-Based Document Indexing for Web Document Clustering
IEEE Transactions on Knowledge and Data Engineering
Accurate unlexicalized parsing
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A new suffix tree similarity measure for document clustering
Proceedings of the 16th international conference on World Wide Web
Density-based clustering for real-time stream data
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A tutorial on spectral clustering
Statistics and Computing
Model-based document clustering with a collapsed gibbs sampler
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting Wikipedia as external knowledge for document clustering
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering via dirichlet process mixture model with feature selection
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Computer Science Review
Hi-index | 0.00 |
In traditional clustering methods, a document is often represented as "bag of words" (in BOW model) or n-grams (in suffix tree document model) without considering the natural language relationships between the words. In this paper, we propose a novel approach DGDC (Dependency Graph-based Document Clustering algorithm) to address this issue. In our algorithm, each document is represented as a dependency graph where the nodes correspond to words which can be seen as meta-descriptions of the document; whereas the edges stand for the relations between pairs of words. A new similarity measure is proposed to compute the pairwise similarity of documents based on their corresponding dependency graphs. By applying the new similarity measure in the Group-average Agglomerative Hierarchial Clustering (GAHC) algorithm, the final clusters of documents can be obtained. The experiments were carried out on five public document datasets. The empirical results have indicated that the DGDC algorithm can achieve better performance in document clustering tasks compared with other approaches based on the BOW model and suffix tree document model.