Detecting similar documents using salient terms

Authors:
James W. Cooper;Anni R. Coden;Eric W. Brown
Affiliations:
IBM T J Watson Research Center, Yorktown Heights, NY;IBM T J Watson Research Center, Yorktown Heights, NY;IBM T J Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the eleventh international conference on Information and knowledge management
Year:
2002

Citing 5
Cited 14

Lexical navigation: visually prompted query expansion and refinement

DL '97 Proceedings of the second ACM international conference on Digital libraries
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
OBIWAN - A Visual Interface for Prompted Query Refinement

HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences - Volume 2
Anti-Serendipity: Finding Useless Documents and Similar Documents

HICSS '00 Proceedings of the 33rd Hawaii International Conference on System Sciences-Volume 3 - Volume 3
Context-aware design and interaction in computer systems

IBM Systems Journal

Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents

Journal of the American Society for Information Science and Technology - Research Articles
Accurate discovery of co-derivative documents via duplicate text detection

Information Systems
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms

Proceedings of the 11th international conference on Artificial intelligence and law
Improving web information indexing and retrieval based on center block duplication detection

International Journal of Innovative Computing and Applications
A unified representation of web logs for mining applications

Information Retrieval
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
A systematic study of parameter correlations in large scale duplicate document detection

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Using word clusters to detect similar web documents

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Keeping Found Things Found: The Study and Practice of Personal Information Management: The Study and Practice of Personal Information Management

Keeping Found Things Found: The Study and Practice of Personal Information Management: The Study and Practice of Personal Information Management
A system for the proactive, continuous, and efficient collection of digital forensic evidence

Digital Investigation: The International Journal of Digital Forensics & Incident Response

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. We compare this to the shingles approach.