Information Retrieval
On Clustering Validation Techniques
Journal of Intelligent Information Systems
PATENT '03 Proceedings of the ACL-2003 workshop on Patent corpus processing - Volume 20
A new efficient and unbiased approach for clustering quality evaluation
PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Hi-index | 0.00 |
This paper presents a new clustering analysis approach based on data samples with multiple labels. It especially deals with the case where each label has no antagonistic label and the absence of a label for a data does not necessarily imply that this data cannot have said label, e.g. the substances in mineral exploration, the keywords of the Web pages, . . . The proposed approach relies on two analyses that are conduced in a parallel way: cluster analysis and label analysis. The cluster analysis aims at selecting the most interesting or relevant clusters. The label analysis aims both at classifying the labels into specific categories such as implicit, explicit, noisy and novel and into more general embedding categories that are relevant and irrelevant. The proposed analysis methods are based on the use of two main informations: the similarity between the data given by the clustering algorithm and the distribution of the labels in the model after a projection of these labels on the classification model. Moreover, these methods make use of original quality measures for performing both labels and cluster analyses. An experimentation in the domain of documentary data highlights the accuracy of the proposed approach.