Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
ROCK: a robust clustering algorithm for categorical attributes
Information Systems
Data mining: concepts and techniques
Data mining: concepts and techniques
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Selforganizing classification on the Reuters news corpus
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Self organization of a massive document collection
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
The ROCK algorithm can be applied to text clustering in large databases. The effectiveness of ROCK, however, is limited, because of the high dimensionality of textual data and traditional proximity measure of documents. In this paper, we propose an improved approach to strengthen the discriminative feature of text documents, which uses asymmetric proximity. Instead of the links count in ROCK, we propose a novel concept of link weight overlaps to measure the proximity between two clusters. The IROCK (Improved ROCK) algorithm performs clustering analysis based on the overlap information of asymmetric proximities between text objects. We carry on the clustering process in an agglomerative hierarchical way. To demonstrate the effectiveness of IROCK, we perform an experimental evaluation on real textual data. A comparison with ROCK and classical algorithms indicates the superiority of our approach.