Using Topic Keyword Clusters for Automatic Document Clustering

Authors:
Hsi-Cheng Chang;Chiun-Chieh Hsu
Affiliations:
The authors are with the Department of Information Management, National Taiwan University of Science and Technology, Taipei, Taiwan. E-mail: cchsu@mail.ntust.edu.tw,;The authors are with the Department of Information Management, National Taiwan University of Science and Technology, Taipei, Taiwan. E-mail: cchsu@mail.ntust.edu.tw,
Venue:
IEICE - Transactions on Information and Systems
Year:
2005

Citing 0
Cited 1

Concept Extraction and Clustering for Topic Digital Library Construction

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data clustering is a technique for grouping similar data items together for convenient understanding. Conventional data clustering methods, including agglomerative hierarchical clustering and partitional clustering algorithms, frequently perform unsatisfactorily for large text collections, since the computation complexities of the conventional data clustering methods increase very quickly with the number of data items. Poor clustering results degrade intelligent applications such as event tracking and information extraction. This paper presents an unsupervised document clustering method which identifies topic keyword clusters of the text corpus. The proposed method adopts a multi-stage process. First, an aggressive data cleaning approach is employed to reduce the noise in the free text and further identify the topic keywords in the documents. All extracted keywords are then grouped into topic keyword clusters using the k-nearest neighbor approach and the keyword clustering technique. Finally, all documents in the corpus are clustered based on the topic keyword clusters. The proposed method is assessed against conventional data clustering methods on a web news corpus. The experimental results show that the proposed method is an efficient and effective clustering approach.