Introduction to algorithms
Building a scalable and accurate copy detection mechanism
Proceedings of the first ACM international conference on Digital libraries
Life, death, and lawfulness on the electronic frontier
Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Adaptive Web sites: automatically synthesizing Web pages
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Topic-oriented collaborative crawling
Proceedings of the eleventh international conference on Information and knowledge management
Text Retrieval Systems for the Web
Programming and Computing Software
Agents, Crawlers, and Web Retrieval
CIA '02 Proceedings of the 6th International Workshop on Cooperative Information Agents VI
Web Information Retrieval - an Algorithmic Perspective
ESA '00 Proceedings of the 8th Annual European Symposium on Algorithms
WWW '03 Proceedings of the 12th international conference on World Wide Web
Algorithmic aspects of information retrieval on the web
Handbook of massive data sets
Challenges in web search engines
ACM SIGIR Forum
Automatic identification of user goals in Web search
WWW '05 Proceedings of the 14th international conference on World Wide Web
LSH forest: self-tuning indexes for similarity search
WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Downloading textual hidden web content through keyword queries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Redundant documents and search effectiveness
Proceedings of the 14th ACM international conference on Information and knowledge management
Undue influence: eliminating the impact of link plagiarism on web search rankings
Proceedings of the 2006 ACM symposium on Applied computing
Stanford WebBase components and applications
ACM Transactions on Internet Technology (TOIT)
Just-in-time recovery of missing web pages
Proceedings of the seventeenth conference on Hypertext and hypermedia
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
Lazy preservation: reconstructing websites by crawling the crawlers
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Efficient search in large textual collections with redundancy
Proceedings of the 16th international conference on World Wide Web
A cost-effective method for detecting web site replicas on search engine databases
Data & Knowledge Engineering
Resource-adaptive real-time new event detection
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Near-replicas of web pages detection efficient algorithm based on single MD5 fingerprint
ICAI'07 Proceedings of the 8th Conference on 8th WSEAS International Conference on Automation and Information - Volume 8
An incremental data mining algorithm for discovering web access patterns
International Journal of Business Intelligence and Data Mining
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Genealogical trees on the web: a search engine user perspective
Proceedings of the 17th international conference on World Wide Web
Improving web information indexing and retrieval based on center block duplication detection
International Journal of Innovative Computing and Applications
Do not crawl in the DUST: Different URLs with similar text
ACM Transactions on the Web (TWEB)
Enterprise Management System with Web-Crawler
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Bringing your dead links back to life: a comprehensive approach and lessons learned
Proceedings of the 20th ACM conference on Hypertext and hypermedia
Frequent Itemset Mining for Clustering Near Duplicate Web Documents
ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
Challenges in web search engines
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Compact full-text indexing of versioned document collections
Proceedings of the 18th ACM conference on Information and knowledge management
Understanding content reuse on the web: static and dynamic analyses
WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Quality-driven query answering for integrated information systems
Quality-driven query answering for integrated information systems
Graph pattern matching: from intractable to polynomial time
Proceedings of the VLDB Endowment
Graph homomorphism revisited for graph matching
Proceedings of the VLDB Endowment
Fixing the threshold for effective detection of near duplicate web documents in web crawling
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Query by document via a decomposition-based two-level retrieval approach
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
On the evolution of clusters of near-duplicate web pages
Journal of Web Engineering
A systematic study of parameter correlations in large scale duplicate document detection
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Enhancing duplicate collection detection through replica boundary discovery
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Indexing shared content in information retrieval systems
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Graph pattern matching revised for social network analysis
Proceedings of the 15th International Conference on Database Theory
Incremental graph pattern matching
ACM Transactions on Database Systems (TODS)
Development of an intelligent distributed news retrieval system
International Journal of Knowledge-based and Intelligent Engineering Systems
Hi-index | 0.01 |
Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.