Density analysis of winnowing on non-uniform distributions

  • Authors:
  • Xiaoming Yu;Yue Liu;Hongbo Xu

  • Affiliations:
  • Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China and Graduate School, Chinese Academy of Sciences, Beijing, P.R. China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China

  • Venue:
  • APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The increasing copies of digital documents make detecting duplicates an important problem. Among the techniques proposed so far, Winnowing fingerprinting algorithm [5] is one of the most efficient. However, the previous density analysis leave the performance of Winnowing unwarranted in real systems, because the assumption of uniformly distributed k-grams is far from true in practice. In this paper, an improved density analysis method is introduced. Compared with the previous, our method needs only identically distributed k-grams to get the prediction. This means our theoretical result can be safely used on highly non-uniformly distributed data which are common in real systems. Extensive experiments are performed on both artificial data and real data. The experiment results agree with the theoretical predictions well.