An Empirical Evaluation of Similarity Coefficients for Binary Valued Data

Authors:
David M. Lewis;Vandana P. Janeja
Affiliations:
Carnegie Mellon University, USA;University of Maryland, Baltimore County, USA
Venue:
International Journal of Data Warehousing and Mining
Year:
2011

Citing 8
Cited 2

Neighborhood based detection of anomalies in high dimensional spatio-temporal sensor datasets

Proceedings of the 2004 ACM symposium on Applied computing
A k-mean clustering algorithm for mixed numeric and categorical data

Data & Knowledge Engineering
Bounds of Resemblance Measures for Binary (Presence/Absence) Variables

Journal of Classification
Searching for relevant software change artifacts using semantic networks

Proceedings of the 2009 ACM symposium on Applied Computing
Similarity coefficient methods applied to the cell formation problem: a comparative investigation

Computers and Industrial Engineering - Special issue: Group technology/cellular manufacturing
Using Semantic Networks and Context in Search for Relevant Software Engineering Artifacts

Journal on Data Semantics XIV
Spatial neighborhood based anomaly detection in sensor datasets

Data Mining and Knowledge Discovery
User-Centric Similarity and Proximity Measures for Spatial Personalization

International Journal of Data Warehousing and Mining

Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

International Journal of Data Warehousing and Mining
Context and semantics for detection of cyber attacks

International Journal of Information and Computer Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, the authors present an empirical evaluation of similarity coefficients for binary valued data. Similarity coefficients provide a means to measure the similarity or distance between two binary valued objects in a dataset such that the attributes qualifying each object have a 0-1 value. This is useful in several domains, such as similarity of feature vectors in sensor networks, document search, router network mining, and web mining. The authors survey 35 similarity coefficients used in various domains and present conclusions about the efficacy of the similarity computed in 1 labeled data to quantify the accuracy of the similarity coefficients, 2 varying density of the data to evaluate the effect of sparsity of the values, and 3 varying number of attributes to see the effect of high dimensionality in the data on the similarity computed.