Clustering quality measures for data samples with multiple labels

Authors:
Mohammed Attik;Shadi Al Shehabi;Jean-Charles Lamirel
Affiliations:
LORIA, Vandœuvre-lès-Nancy, France;LORIA, Vandœuvre-lès-Nancy, France;LORIA, Vandœuvre-lès-Nancy, France
Venue:
DBA'06 Proceedings of the 24th IASTED international conference on Database and applications
Year:
2006

Citing 5
Cited 3

Information Retrieval

Information Retrieval
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Intelligent patent analysis through the use of a neural network: experiment of multi-viewpoint analysis with the MultiSOM model

PATENT '03 Proceedings of the ACL-2003 workshop on Patent corpus processing - Volume 20

Novel labeling strategies for hierarchical representation of multidimensional data analysis results

AIA '08 Proceedings of the 26th IASTED International Conference on Artificial Intelligence and Applications
A new incremental growing neural gas algorithm based on clusters labeling maximization: application to clustering of heterogeneous textual data

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part III
Classifying French verbs using French and English lexical resources

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper focuses on the problem of data classification whenever these data are associated with multiple labels. It especially deals with the case where each label has no antagonistic label and the absence of a label for a data does not necessarily imply that this data cannot have said label, e.g. the substances in mineral exploration, the keywords of the Web pages, . . . We propose new clustering quality measurements which are adapted to data associated with multiple labels. Said measurements are based on the use of two main informations: the similarity between the data given by the clustering algorithm and the distribution of the labels in the model after a projection of these labels on the classification model. Their main area of application is the clustering model selection problem. They can also be used for determining the stopping criterion for the clustering algorithm training. An experimentation of the proposed measurements in the documentary data analysis field shows that they significantly outperform the state of the art.