Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental clustering and dynamic information retrieval
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Wrappers for feature subset selection
Artificial Intelligence - Special issue on relevance
ACM Computing Surveys (CSUR)
Proceedings of the ninth international conference on Information and knowledge management
A vector space model for automatic indexing
Communications of the ACM
Incremental clustering for profile maintenance in information gathering web agents
Proceedings of the fifth international conference on Autonomous agents
High-performing feature selection for text classification
Proceedings of the eleventh international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Incremental Clustering for Mining in a Data Warehousing Environment
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
An Incremental Approach to Building a Cluster Hierarchy
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Similarity-based clustering strategy for mobile ad hoc multimedia databases
Mobile Information Systems
International Journal of Business Intelligence and Data Mining
Some studies on fuzzy clustering of psychosis data
International Journal of Business Intelligence and Data Mining
DSI: A model for distributed multimedia semantic indexing and content integration
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Hi-index | 0.00 |
The VSM with TF-IDF is a popular approach to represent a document. But it is not very fit for clustering in a dynamic or changing corpus because we have to update the TF-IDF value of every dimension of every VSM vector when we add a new file into the corpus. Furthermore, popular feature selection methods, such as DF, IG and chi, need some global corpus information before clustering. We present the heavy frequency vector, which considers only the most frequent words in a document. Since an HFV does not contain any global corpus information, it is easy to implement incremental clustering, especially in dynamic or changing corpus. We compare the HFV-based K-means model with the traditional VSM-based K-means model with different feature selection methods. The results show that the HFV model has better precision than others. However, the complexity of HFV model is greater than others.