High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Authors:
Kansheng Shi;Leming Li
Affiliations:
Shanghai Jiaotong University, Shanghai, China 200240;Chinese Academy of Engineering, Beijing, China 100088
Venue:
Applied Intelligence
Year:
2013

Citing 13
Cited 3

Possibilistic fuzzy co-clustering of large document collections

Pattern Recognition
On clustering tree structured data with categorical nature

Pattern Recognition
Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures

Expert Systems with Applications: An International Journal
Enhanced bisecting k-means clustering using intermediate cooperation

Pattern Recognition
A Communication Perspective on Automatic Text Categorization

IEEE Transactions on Knowledge and Data Engineering
A New Method for Initialising the K-Means Clustering Algorithm

KAM '09 Proceedings of the 2009 Second International Symposium on Knowledge Acquisition and Modeling - Volume 02
Pairwise-adaptive dissimilarity measure for document clustering

Information Sciences: an International Journal
Distributed text classification with an ensemble kernel-based learning approach

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification

IEEE Transactions on Knowledge and Data Engineering
The incremental learning algorithm with support vector machine based on hyperplane-distance

Applied Intelligence
Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization

Expert Systems with Applications: An International Journal
A subspace decision cluster classifier for text classification

Expert Systems with Applications: An International Journal
An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

Applied Intelligence

Co-clustering with augmented matrix

Applied Intelligence
A hierarchical parallel genetic approach for the graph coloring problem

Applied Intelligence
Sentiment analysis based on clustering: a framework in improving accuracy and recognizing neutral opinions

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Among the typical clustering methods, the K-means algorithm plays the most important role in clustering because of its simplicity and efficiency. However, it is sensitive to the initial points and easy to fall into local optimum. In order to avoid this kind of flaw, a patented text clustering algorithm Clustering by Genetic Algorithm Model (CGAM) is revealed in this paper. CGAM constructs the fitness function of genetic algorithm (GA) and convergence criterion for K-means algorithm because GA simulates the natural evolutionary process and deals with a larger search space. To tackle the rich semantics of Chinese texts, CGAM creates an innovative selection method of initial centers of GA and accommodates the contribution of characteristics of different parts of speech. Moreover, the impact of outliers is addressed and treated. Its performance is demonstrated by a series of experiments based on both Reuters-21578 and Chinese text corpus. Experimental results show that the CGAM achieves clustering results better than other GA based K-means algorithms and has been successfully applied to national program of business intelligence system in the context of huge set of contents in both Chinese and English.