Improved robustness of signature-based near-replica detection via lexicon randomization

Authors:
Aleksander Kołcz;Abdur Chowdhury;Joshua Alspector
Affiliations:
AOL, Inc., Dulles, VA;AOL, Inc., Dulles, VA;AOL, Inc., Dulles, VA
Venue:
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2004

Citing 14
Cited 32

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Does WT10g look like the web?

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
A Novel Method for Detecting Similar Documents

HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 4 - Volume 4
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Engineering a multi-purpose test collection for web retrieval experiments

Information Processing and Management: an International Journal
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
"In vivo" spam filtering: a challenge problem for KDD

ACM SIGKDD Explorations Newsletter
The Smart/Empire TIPSTER IR system

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998

Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms

Proceedings of the 11th international conference on Artificial intelligence and law
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Lexicon randomization for near-duplicate detection with I-Match

The Journal of Supercomputing
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Proceedings of the 10th ACM workshop on Web information and data management
Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Frequent Itemset Mining for Clustering Near Duplicate Web Documents

ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
Combinatorial Framework for Similarity Search

SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
MapDupReducer: detecting near duplicates over massive datasets

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Enhancing duplicate collection detection through replica boundary discovery

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Video fingerprinting using Latent Dirichlet Allocation and facial images

Pattern Recognition
A fusion of algorithms in near duplicate document detection

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
Clustering and load balancing optimization for redundant content removal

Proceedings of the 21st international conference companion on World Wide Web
Learning hash codes for efficient content reuse detection

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Cross-Language high similarity search using a conceptual thesaurus

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining
Near-Duplicate detection for online-shops owners: an FCA-Based approach

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Cache-conscious performance optimization for similarity search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A structure free self-adaptive piecewise hashing algorithm for spam filtering

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct inter-document similarity computation (e.g., using the cosine measure) are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, are very attractive computationally but may be brittle with respect to small changes to document content. We focus on approaches to near-replica detection that are based upon large-collection statistics and present a general technique of increasing their robustness via multiple lexicon randomization. In experiments with large web-page and spam-email datasets the proposed method is shown to consistently outperform traditional I-Match, with the relative improvement in duplicate-document recall reaching as high as 40-60%. The large gains in detection accuracy are offset by only small increases in computational requirements.