Learning URL patterns for webpage de-duplication

Authors:
Hema Swetha Koppula;Krishna P. Leela;Amit Agarwal;Krishna Prasad Chitrapura;Sachin Garg;Amit Sasturkar
Affiliations:
Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Picsquare.com, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Inc., Sunnyvale, CA, USA
Venue:
Proceedings of the third ACM international conference on Web search and data mining
Year:
2010

Citing 14
Cited 11

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Induction of Decision Trees

Machine Learning
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
URL normalization for de-duplication of web pages

Proceedings of the 18th ACM conference on Information and knowledge management

Relevance-index size tradeoff in contextual advertising

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning website hierarchies for keyword enrichment in contextual advertising

Proceedings of the fourth ACM international conference on Web search and data mining
The missing links: discovering hidden same-as links among a billion of triples

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

ACM Transactions on the Web (TWEB)
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
Crawling deep web entity pages

Proceedings of the sixth ACM international conference on Web search and data mining
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web (TWEB)
Researcher homepage classification using unlabeled data

Proceedings of the 22nd international conference on World Wide Web
A pattern-based selective recrawling approach for object-level vertical search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework.