CoDet: sentence-based containment detection in news corpora

Authors:
Emre Varol;Fazli Can;Cevdet Aykanat;Oguz Kaya
Affiliations:
Bilkent University, Ankara, Turkey;Bilkent University, Ankara, Turkey;Bilkent University, Ankara, Turkey;Bilkent University, Ankara, Turkey
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 17
Cited 0

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Efficiency of data structures for detecting overlaps in digital documents

ACSC '01 Proceedings of the 24th Australasian conference on Computer science
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
New event detection and topic tracking in Turkish

Journal of the American Society for Information Science and Technology
Adaptive near-duplicate detection via similarity learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
The case of the duplicate documents measurement, search, and science

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study a generalized version of the near-duplicate detection problem which concerns whether a document is a subset of another document. In text-based applications, document containment can be observed in exact-duplicates, near-duplicates, or containments, where the first two are special cases of the third. We introduce a novel method, called CoDet, which focuses particularly on this problem, and compare its performance with four well-known near-duplicate detection methods (DSC, full fingerprinting, I-Match, and SimHash) that are adapted to containment detection. Our method is expandable to different domains, and especially suitable for streaming news. Experimental results show that CoDet effectively and efficiently produces remarkable results in detecting containments.