Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Exploring the similarity space
ACM SIGIR Forum
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A comparison of techniques to find mirrored hosts on the WWW
Journal of the American Society for Information Science
dSCAM: finding document copies across multiple databases
DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Aliasing on the world wide web: prevalence and performance implications
Proceedings of the 11th international conference on World Wide Web
Signature extraction for overlap detection in documents
ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Comparison of Overlap Detection Techniques
ICCS '02 Proceedings of the International Conference on Computational Science-Part I
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Reliable evaluations of URL normalization
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Foundations and Trends in Information Retrieval
CALA: An unsupervised URL-based web page classification system
Knowledge-Based Systems
Hi-index | 0.00 |
We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs, without/examining page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.