Pairwise-adaptive dissimilarity measure for document clustering

Authors:
Joris D'hondt;Joris Vertommen;Paul-Armand Verhaegen;Dirk Cattrysse;Joost R. Duflou
Affiliations:
Centre for Industrial Management, Katholieke Universiteit Leuven, Celestijnenlaan 300A bus 2422, 3001 Heverlee, Belgium;Centre for Industrial Management, Katholieke Universiteit Leuven, Celestijnenlaan 300A bus 2422, 3001 Heverlee, Belgium;Centre for Industrial Management, Katholieke Universiteit Leuven, Celestijnenlaan 300A bus 2422, 3001 Heverlee, Belgium;Centre for Industrial Management, Katholieke Universiteit Leuven, Celestijnenlaan 300A bus 2422, 3001 Heverlee, Belgium;Centre for Industrial Management, Katholieke Universiteit Leuven, Celestijnenlaan 300A bus 2422, 3001 Heverlee, Belgium
Venue:
Information Sciences: an International Journal
Year:
2010

Citing 21
Cited 6

Algorithms for clustering data

Algorithms for clustering data
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Data clustering: a review

ACM Computing Surveys (CSUR)
Item-based collaborative filtering recommendation algorithms

Proceedings of the 10th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Empirical Evaluation of Dissimilarity Measures for Color and Texture

ICCV '99 Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Similarity between Euclidean and cosine angle distance for nearest neighbor queries

Proceedings of the 2004 ACM symposium on Applied computing
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
On the Statistical Properties of the F-measure

QSIC '04 Proceedings of the Quality Software, Fourth International Conference
The BankSearch web document dataset: investigating unsupervised clustering and category similarity

Journal of Network and Computer Applications - Special issue on computational intelligence on the internet
Text classification based on partial least square analysis

Proceedings of the 2007 ACM symposium on Applied computing
Hierarchical clustering of mixed data based on distance hierarchy

Information Sciences: an International Journal
Feature Extraction Using Sequential Semidefinite Programming

DICTA '07 Proceedings of the 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications
Multiple-vector user profiles in support of knowledge sharing

Information Sciences: an International Journal
Clustering high dimensional data: A graph-based relaxed optimization approach

Information Sciences: an International Journal
Cluster Analysis

Cluster Analysis
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Performance evaluation of density-based clustering methods

Information Sciences: an International Journal

An agglomerative clustering algorithm using a dynamic k-nearest-neighbor list

Information Sciences: an International Journal
Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization

Information Sciences: an International Journal
Improving document clustering using Okapi BM25 feature weighting

Information Retrieval
Efficient stochastic algorithms for document clustering

Information Sciences: an International Journal
Rough clustering using generalized fuzzy clustering algorithm

Pattern Recognition
High performance genetic algorithm based text clustering using parts of speech and outlier elimination

Applied Intelligence

Quantified Score

Hi-index	0.07

Visualization

Abstract

This paper introduces a novel pairwise-adaptive dissimilarity measure for large high dimensional document datasets that improves the unsupervised clustering quality and speed compared to the original cosine dissimilarity measure. This measure dynamically selects a number of important features of the compared pair of document vectors. Two approaches for selecting the number of features in the application of the measure are discussed. The proposed feature selection process makes this dissimilarity measure especially applicable in large, high dimensional document collections. Its performance is validated on several test sets originating from standardized datasets. The dissimilarity measure is compared to the well-known cosine dissimilarity measure using the average F-measures of the hierarchical agglomerative clustering result. This new dissimilarity measure results in an improved clustering result obtained with a lower required computational time.