Evaluating the Performance of Similarity Measures Used in Document Clustering and Information Retrieval

  • Authors:
  • R. Subhashini;V. Jawahar Senthil Kumar

  • Affiliations:
  • -;-

  • Venue:
  • ICIIC '10 Proceedings of the 2010 First International Conference on Integrated Intelligent Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents the results of an experimental study of some similarity measures used for both Information Retrieval and Document Clustering. Our results indicate that the cosine similarity measure is superior than the other measures such as Jaccard measure, Euclidean measure that we tested. Cosine Similarity measure is particularly better for text documents. Previously these measures are compared with the conventional text datasets but the proposed system collects the datasets with the help of API and it returns the collection of XML pages. These XML pages are parsed and filtered to get the web document datasets. In this paper, we compare and analyze the effectiveness of these measures for these web document datasets.