Learning URL patterns for webpage de-duplication

  • Authors:
  • Hema Swetha Koppula;Krishna P. Leela;Amit Agarwal;Krishna Prasad Chitrapura;Sachin Garg;Amit Sasturkar

  • Affiliations:
  • Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Picsquare.com, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Inc., Sunnyvale, CA, USA

  • Venue:
  • Proceedings of the third ACM international conference on Web search and data mining
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework.