Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
dSCAM: finding document copies across multiple databases
DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Aliasing on the world wide web: prevalence and performance implications
Proceedings of the 11th international conference on World Wide Web
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Do not crawl in the DUST: different URLs with similar text
Proceedings of the 15th international conference on World Wide Web
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Reliable evaluations of URL normalization
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Disorder inequality: a combinatorial approach to nearest neighbor search
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Sindice.com: a document-oriented lookup index for open linked data
International Journal of Metadata, Semantics and Ontologies
Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
Bringing your dead links back to life: a comprehensive approach and lessons learned
Proceedings of the 20th ACM conference on Hypertext and hypermedia
Combinatorial Framework for Similarity Search
SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
URL normalization for de-duplication of web pages
Proceedings of the 18th ACM conference on Information and knowledge management
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
Foundations and Trends in Information Retrieval
A pattern tree-based approach to learning URL normalization rules
Proceedings of the 19th international conference on World wide web
Relevance-index size tradeoff in contextual advertising
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning top-k transformation rules
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
An evaluation of provenance-based near-duplicates detection
International Journal of Knowledge and Web Intelligence
FoCUS: learning to crawl web forums
Proceedings of the 21st international conference companion on World Wide Web
Towards discovering conceptual models behind web sites
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Crawling deep web entity pages
Proceedings of the sixth ACM international conference on Web search and data mining
Reducing information redundancy in search results
Proceedings of the 28th Annual ACM Symposium on Applied Computing
A pattern-based selective recrawling approach for object-level vertical search
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URLrequests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.