Do not crawl in the DUST: Different URLs with similar text

Authors:
Ziv Bar-Yossef;Idit Keidar;Uri Schonfeld
Affiliations:
Technion Israel Institute of Technology, Haifa, Israel;Technion Israel Institute of Technology, Haifa, Israel;University of California Los Angeles, CA
Venue:
ACM Transactions on the Web (TWEB)
Year:
2009

Citing 18
Cited 2

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Exploring the similarity space

ACM SIGIR Forum
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
dSCAM: finding document copies across multiple databases

DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Aliasing on the world wide web: prevalence and performance implications

Proceedings of the 11th international conference on World Wide Web
Signature extraction for overlap detection in documents

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Comparison of Overlap Detection Techniques

ICCS '02 Proceedings of the International Conference on Computational Science-Part I
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Reliable evaluations of URL normalization

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V

Adversarial Web Search

Foundations and Trends in Information Retrieval
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs, without/examining page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.