The heavy frequency vector-based text clustering

Authors:
Jun-Peng Bao;Jun-Yi Shen;Xiao-Dong Liu;Hai-Yan Liu
Affiliations:
Department of Computer Science and Engineering, Xi;an Jiaotong University, China.;Department of Computer Science and Engineering, Xi;an Jiaotong University, China.
Venue:
International Journal of Business Intelligence and Data Mining
Year:
2005

Citing 11
Cited 4

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental clustering and dynamic information retrieval

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Data clustering: a review

ACM Computing Surveys (CSUR)
Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval

Proceedings of the ninth international conference on Information and knowledge management
A vector space model for automatic indexing

Communications of the ACM
Incremental clustering for profile maintenance in information gathering web agents

Proceedings of the fifth international conference on Autonomous agents
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Incremental Clustering for Mining in a Data Warehousing Environment

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
An Incremental Approach to Building a Cluster Hierarchy

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining

Similarity-based clustering strategy for mobile ad hoc multimedia databases

Mobile Information Systems
Unsupervised Topic Detection in document collections: an application in marketing and business journals

International Journal of Business Intelligence and Data Mining
Some studies on fuzzy clustering of psychosis data

International Journal of Business Intelligence and Data Mining
DSI: A model for distributed multimedia semantic indexing and content integration

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The VSM with TF-IDF is a popular approach to represent a document. But it is not very fit for clustering in a dynamic or changing corpus because we have to update the TF-IDF value of every dimension of every VSM vector when we add a new file into the corpus. Furthermore, popular feature selection methods, such as DF, IG and chi, need some global corpus information before clustering. We present the heavy frequency vector, which considers only the most frequent words in a document. Since an HFV does not contain any global corpus information, it is easy to implement incremental clustering, especially in dynamic or changing corpus. We compare the HFV-based K-means model with the traditional VSM-based K-means model with different feature selection methods. The results show that the HFV model has better precision than others. However, the complexity of HFV model is greater than others.