The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Mercator: A scalable, extensible Web crawler
World Wide Web
Local versus global link information in the Web
ACM Transactions on Information Systems (TOIS)
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
CoBWeb A Crawler for the Brazilian Web
SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
A pattern tree-based approach to learning URL normalization rules
Proceedings of the 19th international conference on World wide web
Hi-index | 0.00 |
Identifying replicated sites is an important task for search engines. It can reduce data storage costs, improve query processing time and remove noise that might affect the quality of the final answers given to the user. This paper introduces a new approach to detect web sites that are likely to be replicas in a search engine database. Our method uses the websites' structure and the content of their pages to identify possible replicas. As we show through experiments, such a combination improves the precision and reduces the overall costs related to the replica detection task. Our method achieves a quality improvement of 47.23% when compared to previously proposed approaches.