A cost-effective method for detecting web site replicas on search engine databases

Authors:
André Luiz da Costa Carvalho;Edleno Silva de Moura;Altigran Soares da Silva;Klessius Berlt;Allan Bezerra
Affiliations:
Federal University of Amazonas, Computer Science Department, Av. Rodrigo Octávio, Ramos 3000, Manaus, Brazil;Federal University of Amazonas, Computer Science Department, Av. Rodrigo Octávio, Ramos 3000, Manaus, Brazil;Federal University of Amazonas, Computer Science Department, Av. Rodrigo Octávio, Ramos 3000, Manaus, Brazil;Federal University of Amazonas, Computer Science Department, Av. Rodrigo Octávio, Ramos 3000, Manaus, Brazil;Federal University of Amazonas, Computer Science Department, Av. Rodrigo Octávio, Ramos 3000, Manaus, Brazil
Venue:
Data & Knowledge Engineering
Year:
2007

Citing 11
Cited 1

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Mercator: A scalable, extensible Web crawler

World Wide Web
Local versus global link information in the Web

ACM Transactions on Information Systems (TOIS)
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
CoBWeb A Crawler for the Brazilian Web

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware

A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying replicated sites is an important task for search engines. It can reduce data storage costs, improve query processing time and remove noise that might affect the quality of the final answers given to the user. This paper introduces a new approach to detect web sites that are likely to be replicas in a search engine database. Our method uses the websites' structure and the content of their pages to identify possible replicas. As we show through experiments, such a combination improves the precision and reduces the overall costs related to the replica detection task. Our method achieves a quality improvement of 47.23% when compared to previously proposed approaches.