Weighted shingling: an adaptation of shingling for weighted shingles

  • Authors:
  • Zahra Eskandari Gharghe;Behrouz Minaei Bidgoli

  • Affiliations:
  • Iran University of Science and Technology;Iran University of Science and Technology

  • Venue:
  • IIT'09 Proceedings of the 6th international conference on Innovations in information technology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Broder's shingling is one of the state-of-the-art approaches in detecting near-duplicate documents. Prior evaluations of this method have shown that document-pairs which have different main content but have a large amount ofsimilar unimportant details are the main sources of its errors. Different web pages from the same site are a good example of such documents. In such pages, almost always there is a similar boilerplate text which has a chance to be selected as the document's fingerprint and trick the algorithm. It seems that this problem is due to representing each document only by a sample of its shingles. This sample only contains some ofthe page's shingles and discards any other information. by Including additional information such as frequencies of shingles in this sample, we can improve the performance ofthe algorithm. This paper proposes a weighting of shingles and adapts shingling to be applied on weighted shingles. Our results have shown an improvement in shingling's performance.