Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
Broder's shingling is one of the state-of-the-art approaches in detecting near-duplicate documents. Prior evaluations of this method have shown that document-pairs which have different main content but have a large amount ofsimilar unimportant details are the main sources of its errors. Different web pages from the same site are a good example of such documents. In such pages, almost always there is a similar boilerplate text which has a chance to be selected as the document's fingerprint and trick the algorithm. It seems that this problem is due to representing each document only by a sample of its shingles. This sample only contains some ofthe page's shingles and discards any other information. by Including additional information such as frequencies of shingles in this sample, we can improve the performance ofthe algorithm. This paper proposes a weighting of shingles and adapts shingling to be applied on weighted shingles. Our results have shown an improvement in shingling's performance.