Evaluating the Performance of Similarity Measures Used in Document Clustering and Information Retrieval

Authors:
R. Subhashini;V. Jawahar Senthil Kumar
Affiliations:
-;-
Venue:
ICIIC '10 Proceedings of the 2010 First International Conference on Integrated Intelligent Computing
Year:
2010

Citing 0
Cited 1

An innovative way for mining clinical and administrative healthcare data

AMT'12 Proceedings of the 8th international conference on Active Media Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the results of an experimental study of some similarity measures used for both Information Retrieval and Document Clustering. Our results indicate that the cosine similarity measure is superior than the other measures such as Jaccard measure, Euclidean measure that we tested. Cosine Similarity measure is particularly better for text documents. Previously these measures are compared with the conventional text datasets but the proposed system collects the datasets with the help of API and it returns the collection of XML pages. These XML pages are parsed and filtered to get the web document datasets. In this paper, we compare and analyze the effectiveness of these measures for these web document datasets.