Text Clustering with Feature Selection by Using Statistical Data

Authors:
Yanjun Li;Congnan Luo;Soon M. Chung
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2008

Citing 0
Cited 15

A new sentence similarity measure and sentence based extractive technique for automatic text summarization

Expert Systems with Applications: An International Journal
Clustering of document collection - A weighting approach

Expert Systems with Applications: An International Journal
Performance evaluation of density-based clustering methods

Information Sciences: an International Journal
Text document clustering based on neighbors

Data & Knowledge Engineering
An incremental affinity propagation algorithm and its applications for text clustering

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
Collaborative content and user-based web ontology learning system

FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
Knowledge discovery from text learning for ontology modeling

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Shape pattern matching: A tool to cluster unstructured text documents

Journal of Computational Methods in Sciences and Engineering - Special Supplement Issue in Section A and B: Selected Papers from the ISCA International Conference on Software Engineering and Data Engineering, 2009
Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization

Expert Systems with Applications: An International Journal
An IPC-based vector space model for patent retrieval

Information Processing and Management: an International Journal
A parallel ACO algorithm to select terms to categorise longer documents

International Journal of Computational Science and Engineering
An enhanced ACO algorithm to select features for text categorization and its parallelization

Expert Systems with Applications: An International Journal
Vector space model for patent documents with hierarchical class labels

Journal of Information Science
A three-phase method for patent classification

Information Processing and Management: an International Journal
Text Document Clustering with Hybrid Feature Selection

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.01

Visualization

Abstract

Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervised feature selection method, named CHIR, which is based on the Chi-square statistic and new statistical data that can measure the positive term-category dependency. We also propose a new text clustering algorithm TCFS, which stands for Text Clustering with Feature Selection. TCFS can incorporate CHIR to identify relevant features (i.e., terms) iteratively, and the clustering becomes a learning process. We compared TCFS and the k-means clustering algorithm in combination with different feature selection methods for various real data sets. Our experimental results show that TCFS with CHIR has better clustering accuracy in terms of the F-measure and the purity.