Do not crawl in the dust: different urls with similar text

Authors:
Ziv Bar-Yossef;Idit Keidar;Uri Schonfeld
Affiliations:
Technion and Google, Haifa, Israel;Technion, Haifa, Israel;UCLA, Log Angeles, CA
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 15
Cited 20

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
dSCAM: finding document copies across multiple databases

DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Aliasing on the world wide web: prevalence and performance implications

Proceedings of the 11th international conference on World Wide Web
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Do not crawl in the DUST: different URLs with similar text

Proceedings of the 15th international conference on World Wide Web
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Reliable evaluations of URL normalization

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V

Disorder inequality: a combinatorial approach to nearest neighbor search

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Sindice.com: a document-oriented lookup index for open linked data

International Journal of Metadata, Semantics and Ontologies
Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Bringing your dead links back to life: a comprehensive approach and lessons learned

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Combinatorial Framework for Similarity Search

SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
URL normalization for de-duplication of web pages

Proceedings of the 18th ACM conference on Information and knowledge management
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
Web Crawling

Foundations and Trends in Information Retrieval
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web
Relevance-index size tradeoff in contextual advertising

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning top-k transformation rules

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
An evaluation of provenance-based near-duplicates detection

International Journal of Knowledge and Web Intelligence
FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
Towards discovering conceptual models behind web sites

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining
Reducing information redundancy in search results

Proceedings of the 28th Annual ACM Symposium on Applied Computing
A pattern-based selective recrawling approach for object-level vertical search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URLrequests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.