Improved ROCK for text clustering using asymmetric proximity

Authors:
Shaoxu Song;Chunping Li
Affiliations:
School of Software, Tsinghua University, Beijing, China;School of Software, Tsinghua University, Beijing, China
Venue:
SOFSEM'06 Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science
Year:
2006

Citing 7
Cited 0

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Data mining: concepts and techniques

Data mining: concepts and techniques
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Selforganizing classification on the Reuters news corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Self organization of a massive document collection

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ROCK algorithm can be applied to text clustering in large databases. The effectiveness of ROCK, however, is limited, because of the high dimensionality of textual data and traditional proximity measure of documents. In this paper, we propose an improved approach to strengthen the discriminative feature of text documents, which uses asymmetric proximity. Instead of the links count in ROCK, we propose a novel concept of link weight overlaps to measure the proximity between two clusters. The IROCK (Improved ROCK) algorithm performs clustering analysis based on the overlap information of asymmetric proximities between text objects. We carry on the clustering process in an agglomerative hierarchical way. To demonstrate the effectiveness of IROCK, we perform an experimental evaluation on real textual data. A comparison with ROCK and classical algorithms indicates the superiority of our approach.