Finding replicated Web collections

Authors:
Junghoo Cho;Narayanan Shivakumar;Hector Garcia-Molina
Affiliations:
Department of Computer Science, Stanford, CA;Department of Computer Science, Stanford, CA;Department of Computer Science, Stanford, CA
Venue:
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Year:
2000

Citing 9
Cited 49

Introduction to algorithms

Introduction to algorithms
Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
Life, death, and lawfulness on the electronic frontier

Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Adaptive Web sites: automatically synthesizing Web pages

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997

Topic-oriented collaborative crawling

Proceedings of the eleventh international conference on Information and knowledge management
Text Retrieval Systems for the Web

Programming and Computing Software
Agents, Crawlers, and Web Retrieval

CIA '02 Proceedings of the 6th International Workshop on Cooperative Information Agents VI
Web Information Retrieval - an Algorithmic Perspective

ESA '00 Proceedings of the 8th Annual European Symposium on Algorithms
The XML web: a first study

WWW '03 Proceedings of the 12th international conference on World Wide Web
Algorithmic aspects of information retrieval on the web

Handbook of massive data sets
Challenges in web search engines

ACM SIGIR Forum
Automatic identification of user goals in Web search

WWW '05 Proceedings of the 14th international conference on World Wide Web
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Studying the XML Web: Gathering Statistics from an XML Sample

World Wide Web
Undue influence: eliminating the impact of link plagiarism on web search rankings

Proceedings of the 2006 ACM symposium on Applied computing
Stanford WebBase components and applications

ACM Transactions on Internet Technology (TOIT)
Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Lazy preservation: reconstructing websites by crawling the crawlers

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
A cost-effective method for detecting web site replicas on search engine databases

Data & Knowledge Engineering
Resource-adaptive real-time new event detection

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Near-replicas of web pages detection efficient algorithm based on single MD5 fingerprint

ICAI'07 Proceedings of the 8th Conference on 8th WSEAS International Conference on Automation and Information - Volume 8
An incremental data mining algorithm for discovering web access patterns

International Journal of Business Intelligence and Data Mining
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Genealogical trees on the web: a search engine user perspective

Proceedings of the 17th international conference on World Wide Web
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
Do not crawl in the DUST: Different URLs with similar text

ACM Transactions on the Web (TWEB)
Enterprise Management System with Web-Crawler

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Bringing your dead links back to life: a comprehensive approach and lessons learned

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Frequent Itemset Mining for Clustering Near Duplicate Web Documents

ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
Challenges in web search engines

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Understanding content reuse on the web: static and dynamic analyses

WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Quality-driven query answering for integrated information systems

Quality-driven query answering for integrated information systems
Graph pattern matching: from intractable to polynomial time

Proceedings of the VLDB Endowment
Graph homomorphism revisited for graph matching

Proceedings of the VLDB Endowment
Fixing the threshold for effective detection of near duplicate web documents in web crawling

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Query by document via a decomposition-based two-level retrieval approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
On the evolution of clusters of near-duplicate web pages

Journal of Web Engineering
A systematic study of parameter correlations in large scale duplicate document detection

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Enhancing duplicate collection detection through replica boundary discovery

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Indexing shared content in information retrieval systems

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Graph pattern matching revised for social network analysis

Proceedings of the 15th International Conference on Database Theory
Incremental graph pattern matching

ACM Transactions on Database Systems (TODS)
Development of an intelligent distributed news retrieval system

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.