De-duping URLs via rewrite rules

Authors:
Anirban Dasgupta;Ravi Kumar;Amit Sasturkar
Affiliations:
Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA;Yahoo!, Sunnyvale, CA, USA
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 15
Cited 10

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Inference of Reversible Languages

Journal of the ACM (JACM)
Inductive Inference: Theory and Methods

ACM Computing Surveys (CSUR)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Finding patterns common to a set of strings (Extended Abstract)

STOC '79 Proceedings of the eleventh annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Where and How Duplicates Occur in the Web

LA-WEB '06 Proceedings of the Fourth Latin American Web Congress
Pair-Wise entity resolution: overview and challenges

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

URL normalization for de-duplication of web pages

Proceedings of the 18th ACM conference on Information and knowledge management
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
Web Crawling

Foundations and Trends in Information Retrieval
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web
Learning website hierarchies for keyword enrichment in contextual advertising

Proceedings of the fourth ACM international conference on Web search and data mining
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

ACM Transactions on the Web (TWEB)
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Learning top-k transformation rules

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content. In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.