A fusion of algorithms in near duplicate document detection

  • Authors:
  • Jun Fan;Tiejun Huang

  • Affiliations:
  • National Engineering Laboratory for Video Technology, School of EE & CS, Peking University, Beijing, China;National Engineering Laboratory for Video Technology, School of EE & CS, Peking University, Beijing, China

  • Venue:
  • PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the rapid development of the World Wide Web, there are a huge number of fully or fragmentally duplicated pages in the Internet. Return of these near duplicated results to the users greatly affects user experiences. In the process of deploying digital libraries, the protection of intellectual property and removal of duplicate contents needs to be considered. This paper fuses some "state of the art" algorithms to reach a better performance. We first introduce the three major algorithms (shingling, I-match, simhash) in duplicate document detection and their developments in the following days. We take sequences of words (shingles) as the feature of simhash algorithm. We then import the random lexicons based multi fingerprints generation method into shingling base simhash algorithm and named it shingling based multi fingerprints simhash algorithm. We did some preliminary experiments on the synthetic dataset based on the "China-US Million Book Digital Library Project". The experiment result proves the efficiency of these algorithms.