A theory of parameterized pattern matching: algorithms and applications
STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Dictionary look-up with one error
Journal of Algorithms
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities
WWW '99 Proceedings of the eighth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Neighborhood preserving hashing and approximate queries
SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
Selectively estimation for Boolean queries
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A comparison of techniques to find mirrored hosts on the WWW
Journal of the American Society for Information Science
Improved bounds for dictionary look-up with one error
Information Processing Letters
Efficient and tumble similar set retrieval
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
ACM Transactions on Internet Technology (TOIT)
Evaluating topic-driven web crawlers
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A low-bandwidth network file system
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
Detecting similar documents using salient terms
Proceedings of the eleventh international conference on Information and knowledge management
Mining the Web: Discovering Knowledge from HyperText Data
Mining the Web: Discovering Knowledge from HyperText Data
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Dictionary Look-Up within Small Edit Distance
COCOON '02 Proceedings of the 8th Annual International Conference on Computing and Combinatorics
Approximate Dictionary Queries
CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Methods for identifying versioned and plagiarized documents
Journal of the American Society for Information Science and Technology
Efficient URL caching for world wide web crawling
WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On finding duplication and near-duplication in large software systems
WCRE '95 Proceedings of the Second Working Conference on Reverse Engineering
Finding Interesting Associations without Support Pruning
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A bag of paths model for measuring structural similarity in Web documents
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Constructing a text corpus for inexact duplicate detection
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Phrase-Based Document Indexing for Web Document Clustering
IEEE Transactions on Knowledge and Data Engineering
WWW '05 Proceedings of the 14th international conference on World Wide Web
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
On searching compressed string collections cache-obliviously
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Exploring traversal strategy for web forum crawling
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Achieving both high precision and high recall in near-duplicate detection
Proceedings of the 17th ACM conference on Information and knowledge management
Efficient overlap and content reuse detection in blogs and online news articles
Proceedings of the 18th international conference on World wide web
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Automatic video tagging using content redundancy
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Automatic retrieval of similar content using search engine query interface
Proceedings of the 18th ACM conference on Information and knowledge management
URL normalization for de-duplication of web pages
Proceedings of the 18th ACM conference on Information and knowledge management
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
A pattern tree-based approach to learning URL normalization rules
Proceedings of the 19th international conference on World wide web
Proceedings of the 19th international conference on World wide web
Detecting near-duplicates in large-scale short text databases
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Similarity search and locality sensitive hashing using ternary content addressable memories
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Similarity joins as stronger metric operations
SIGSPATIAL Special
A locality-sensitive hash for real vectors
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Simple and efficient algorithm for approximate dictionary matching
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
XML structural similarity search using mapreduce
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Fixing the threshold for effective detection of near duplicate web documents in web crawling
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Exponential time improvement for min-wise based algorithms
Information and Computation
PRESIDIO: A Framework for Efficient Archival Data Storage
ACM Transactions on Storage (TOS)
Content redundancy in YouTube and its application to video tagging
ACM Transactions on Information Systems (TOIS)
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Semi-supervised SimHash for efficient document similarity search
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Hypergeometric language models for republished article finding
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Fast locality-sensitive hashing
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A strategy for efficient crawling of rich internet applications
ICWE'11 Proceedings of the 11th international conference on Web engineering
Probabilistic near-duplicate detection using simhash
Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora
Proceedings of the 20th ACM international conference on Information and knowledge management
Detection of near-duplicate user generated contents: the SMS spam collection
Proceedings of the 3rd international workshop on Search and mining user-generated contents
A simhash-based scheme for locating product information from the web
Proceedings of the Second Symposium on Information and Communication Technology
Temporal shingling for version identification in web archives
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Exponential time improvement for min-wise based algorithms
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Bayesian locality sensitive hashing for fast similarity search
Proceedings of the VLDB Endowment
Efficient semantic-aware detection of near duplicate resources
ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
A fusion of algorithms in near duplicate document detection
PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
FoCUS: learning to crawl web forums
Proceedings of the 21st international conference companion on World Wide Web
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
On generating large-scale ground truth datasets for the deduplication of bibliographic records
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
High-confidence near-duplicate image detection
Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
De-duplication of aggregation authority files
International Journal of Metadata, Semantics and Ontologies
Fast near neighbor search in high-dimensional binary data
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Crawling deep web entity pages
Proceedings of the sixth ACM international conference on Web search and data mining
HmSearch: an efficient hamming distance query processing algorithm
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Groundhog day: near-duplicate detection on Twitter
Proceedings of the 22nd international conference on World Wide Web
Bottom-k and priority sampling, set similarity and subset sums with minimal independence
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Revision graph extraction in Wikipedia based on supergram decomposition
Proceedings of the 9th International Symposium on Open Collaboration
Near duplicate detection in an academic digital library
Proceedings of the 2013 ACM symposium on Document engineering
The fingerprint analysis technique-oriented research on microblog for public opinion analysis
Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
Locality sensitive hashing for scalable structural classification and clustering of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A pattern-based selective recrawling approach for object-level vertical search
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Efficient filtering and ranking schemes for finding inclusion dependencies on the web
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
De-duplication of aggregation authority files
International Journal of Metadata, Semantics and Ontologies
b-bit minwise hashing in practice
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Efficient top-k retrieval with signatures
Proceedings of the 18th Australasian Document Computing Symposium
Streaming similarity search over one billion tweets using parallel locality-sensitive hashing
Proceedings of the VLDB Endowment
Efficient estimation for high similarities using odd sketches
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.02 |
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.