URL normalization for de-duplication of web pages

Authors:
Amit Agarwal;Hema Swetha Koppula;Krishna P. Leela;Krishna Prasad Chitrapura;Sachin Garg;Pavan Kumar GM;Chittaranjan Haty;Anirban Roy;Amit Sasturkar
Affiliations:
Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Inc., Bangalore, India;Yahoo! Inc., Bangalore, India;Yahoo! Inc., Sunnyvale, CA, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 9
Cited 5

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Induction of Decision Trees

Machine Learning
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
Web Crawling

Foundations and Trends in Information Retrieval
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web
The missing links: discovering hidden same-as links among a billion of triples

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these learnt rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract specific rules from URLs belonging to each cluster. Preserving each mined rules for de-duplication is not efficient due to the large number of specific rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We demonstrate the effectiveness of our techniques through experimental evaluation.