Weighted shingling: an adaptation of shingling for weighted shingles

Authors:
Zahra Eskandari Gharghe;Behrouz Minaei Bidgoli
Affiliations:
Iran University of Science and Technology;Iran University of Science and Technology
Venue:
IIT'09 Proceedings of the 6th international conference on Innovations in information technology
Year:
2009

Citing 5
Cited 0

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Broder's shingling is one of the state-of-the-art approaches in detecting near-duplicate documents. Prior evaluations of this method have shown that document-pairs which have different main content but have a large amount ofsimilar unimportant details are the main sources of its errors. Different web pages from the same site are a good example of such documents. In such pages, almost always there is a similar boilerplate text which has a chance to be selected as the document's fingerprint and trick the algorithm. It seems that this problem is due to representing each document only by a sample of its shingles. This sample only contains some ofthe page's shingles and discards any other information. by Including additional information such as frequencies of shingles in this sample, we can improve the performance ofthe algorithm. This paper proposes a weighting of shingles and adapts shingling to be applied on weighted shingles. Our results have shown an improvement in shingling's performance.