A diversity measure leveraging domain specific auxiliary information

  • Authors:
  • Narayan Bhamidipati;Nagaraj Kota

  • Affiliations:
  • Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India

  • Venue:
  • Proceedings of the 20th ACM international conference on Information and knowledge management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This article deals with the notion of reduction in uncertainty when the probability mass is distributed over similar values than dissimilar values. Shannon's entropy is a frequently used information theoretic measure of the uncertainty associated with random variables, but it depends solely on the set of values the probability mass function assumes, and does not take into consideration whether the mass is distributed among extreme values or not. A similarity structure, possibly obtained through domain knowledge, on the values assumed by the random variable may reduce the associated uncertainty. More the similarity, less the uncertainty. A novel measure named Similarity Adjusted Entropy (or Sim-adjusted Entropy for short), that generalizes Shannon's entropy, is then proposed to capture the effects of this similarity structure. Sim-adjusted entropy provides a mechanism for incorporating the domain expertise into an entropy based framework for solving various data mining tasks. Applications highlighted in this manuscript include clustering of categorical data and measuring audience diversity. Experiments performed on Yahoo! Answers data set demonstrate the ability of the proposed method to obtain more cohesive clusters. Another set of experiments confirm the utility of the proposed measure for measuring audience diversity.