Performance evaluation of density-based clustering methods

  • Authors:
  • Ramiz M. Aliguliyev

  • Affiliations:
  • Institute of Information Technology of Azerbaijan National Academy of Sciences, Department of Artificial Intelligence and Computer Sciences, 9, F. Agayev Street, Baku AZ1141, Azerbaijan

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2009

Quantified Score

Hi-index 0.08

Visualization

Abstract

With the development of the World Wide Web, document clustering is receiving more and more attention as an important and fundamental technique for unsupervised document organization, automatic topic extraction, and fast information retrieval or filtering. A good document clustering approach can assist computers in organizing the document corpus automatically into a meaningful cluster hierarchy for efficient browsing and navigation, which is very valuable for complementing the deficiencies of traditional information retrieval technologies. In this paper, we study the performance of different density-based criterion functions, which can be classified as internal, external or hybrid, in the context of partitional clustering of document datasets. In our study, a weight was assigned to each document, which defined its relative position in the entire collection. To show the efficiency of the proposed approach, the weighted methods were compared to their unweighted variants. To verify the robustness of the proposed approach, experiments were conducted on datasets with a wide variety of numbers of clusters, documents and terms. To evaluate the criterion functions, we used the WebKb, Reuters-21578, 20Newsgroups-18828, WebACE and TREC-5 datasets, as they are currently the most widely used benchmarks in document clustering research. To evaluate the quality of a clustering solution, a wide spectrum of indices, three internal validity indices and seven external validity indices, were used. The internal validity indices were used for evaluating the within-cluster scatter and between cluster separations. The external validity indices were used for comparing the clustering solutions produced by the proposed criterion functions with the ''ground truth'' results. Experiments showed that our approach significantly improves clustering quality. In this paper, we developed a modified differential evolution (DE) algorithm to optimize the criterion functions. This modification accelerates the convergence of DE and, unlike the basic DE algorithm, guarantees that the received solution will be feasible.