Using Topic Keyword Clusters for Automatic Document Clustering

Authors:
Hsi-Cheng Chang;Chiun-Chieh Hsu
Affiliations:
Hwa Hsia Institute of Technology;Taiwan University of Science and Technology
Venue:
ICITA '05 Proceedings of the Third International Conference on Information Technology and Applications (ICITA'05) Volume 2 - Volume 02
Year:
2005

Citing 0
Cited 4

Text Pre-processing for Document Clustering

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Using Text Segmentation to Enhance the Cluster Hypothesis

AIMSA '08 Proceedings of the 13th international conference on Artificial Intelligence: Methodology, Systems, and Applications
Social trend tracking by time series based social tagging clustering

Expert Systems with Applications: An International Journal
Analysis of single-objective and multi-objective evolutionary algorithms in keyword cluster optimization

EUROCAST'11 Proceedings of the 13th international conference on Computer Aided Systems Theory - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data clustering is a technique for grouping similar data items together for convenient understanding. Conventional data clustering methods, including agglomerative hierarchical clustering and partitional clustering algorithms frequently perform unsatisfactorily for large text article collections, as well as the computation complexity of the conventional data clustering methods increase very quick with the number of data items. This paper presents a system for automatic document clustering by identifying topic keyword clusters of the text corpus. The proposed system adopts a multi-stage process. First, an aggressive data cleaning approach is employed to reduce the noise in the free text and further identify the topic keywords within the documents. All extracted keywords are then grouped into topic keyword clusters using the k-nearest neighbor graph approach and the keyword clustering function. Finally, all documents in the corpus are clustered based on the topic keyword clusters. The proposed method was assessed against conventional data clustering methods on a web news collection, indicating that the proposed method is an efficient and effective clustering approach.