Near-replicas of web pages detection efficient algorithm based on single MD5 fingerprint

Authors:
Wang Da-Zhen;Chen Yu-Hui
Affiliations:
Department of Computer Science, Hubei University of Technology, Wuhan, P.R.C. and Department of Information Management, Wuhan University, Wuhan, P.R.C.;Department of Computer Science, Hubei University of Technology, Wuhan, P.R.C.
Venue:
ICAI'07 Proceedings of the 8th Conference on 8th WSEAS International Conference on Automation and Information - Volume 8
Year:
2007

Citing 7
Cited 0

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A preprocessing framework and approach for web applications

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider how to efficiently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archives and in the presentation of search results, among others. Our experiments show that how common replication is on the web, and testified that our algorithm is better than others.