Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Machine Learning
Adaptive on-line page importance computation
WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Purely URL-based topic classification
Proceedings of the 18th international conference on World wide web
URL normalization for de-duplication of web pages
Proceedings of the 18th ACM conference on Information and knowledge management
Relevance-index size tradeoff in contextual advertising
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning website hierarchies for keyword enrichment in contextual advertising
Proceedings of the fourth ACM international conference on Web search and data mining
The missing links: discovering hidden same-as links among a billion of triples
Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
ACM Transactions on the Web (TWEB)
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
FoCUS: learning to crawl web forums
Proceedings of the 21st international conference companion on World Wide Web
Crawling deep web entity pages
Proceedings of the sixth ACM international conference on Web search and data mining
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
ACM Transactions on the Web (TWEB)
Researcher homepage classification using unlabeled data
Proceedings of the 22nd international conference on World Wide Web
A pattern-based selective recrawling approach for object-level vertical search
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
CALA: An unsupervised URL-based web page classification system
Knowledge-Based Systems
Hi-index | 0.00 |
Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework.