Lexicon randomization for near-duplicate detection with I-Match

Authors:
A. Kołcz;A. Chowdhury
Affiliations:
Microsoft Live Labs, Redmonnd, USA 98052;University of Bremen TZI, Bremen, Germany
Venue:
The Journal of Supercomputing
Year:
2008

Citing 20
Cited 4

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Bagging predictors

Machine Learning
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Does WT10g look like the web?

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Methods for identifying versioned and plagiarized documents

Journal of the American Society for Information Science and Technology
Bursty and hierarchical structure in streams

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Novel Method for Detecting Similar Documents

HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 4 - Volume 4
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Engineering a multi-purpose test collection for web retrieval experiments

Information Processing and Management: an International Journal
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
"In vivo" spam filtering: a challenge problem for KDD

ACM SIGKDD Explorations Newsletter
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Email Spam Filtering: A Systematic Review

Foundations and Trends in Information Retrieval
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Detection of near-duplicate user generated contents: the SMS spam collection

Proceedings of the 3rd international workshop on Search and mining user-generated contents
A structure free self-adaptive piecewise hashing algorithm for spam filtering

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional techniques relying on direct inter-document similarity computation are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, while very attractive computationally, can be unstable even to small perturbations of document content, which causes signature fragmentation. We focus on I-Match and present a randomization-based technique of increasing its signature stability, with the proposed method consistently outperforming traditional I-Match by as high as 40---60% in terms of the relative improvement in near-duplicate recall. Importantly, the large gains in detection accuracy are offset by only small increases in computational requirements. We also address the complimentary problem of spurious matches, which is particularly important for I-Match when fingerprinting long documents. Our discussion is supported by experiments involving large web-page and email datasets.