A New Similarity Metric for Sequential Data

  • Authors:
  • Pradeep Kumar;Bapi S. Raju;P. Radha Krishna

  • Affiliations:
  • Indian Institute of Management, India;University of Hyderabad, India;Infosys Technologies Limited, Hyderabad, India

  • Venue:
  • International Journal of Data Warehousing and Mining
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this paper, the authors propose a similarity preserving function called Sequence and Set Similarity Measure S3M that captures both the order of occurrence of items in sequences and the constituent items of sequences. The authors demonstrate the usefulness of the proposed measure for classification and clustering tasks. Experiments were conducted on benchmark datasets, that is, DARPA'98 and msnbc, for classification task in intrusion detection and clustering task in web mining domains. Results show the usefulness of the proposed measure.