Numerical recipes in C (2nd ed.): the art of scientific computing
Numerical recipes in C (2nd ed.): the art of scientific computing
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Finding similar files in large document repositories
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Hi-index | 0.00 |
The increasing copies of digital documents make detecting duplicates an important problem. Among the techniques proposed so far, Winnowing fingerprinting algorithm [5] is one of the most efficient. However, the previous density analysis leave the performance of Winnowing unwarranted in real systems, because the assumption of uniformly distributed k-grams is far from true in practice. In this paper, an improved density analysis method is introduced. Compared with the previous, our method needs only identically distributed k-grams to get the prediction. This means our theoretical result can be safely used on highly non-uniformly distributed data which are common in real systems. Extensive experiments are performed on both artificial data and real data. The experiment results agree with the theoretical predictions well.