Lexical navigation: visually prompted query expansion and refinement
DL '97 Proceedings of the second ACM international conference on Digital libraries
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
OBIWAN - A Visual Interface for Prompted Query Refinement
HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences - Volume 2
Anti-Serendipity: Finding Useless Documents and Similar Documents
HICSS '00 Proceedings of the 33rd Hawaii International Conference on System Sciences-Volume 3 - Volume 3
Context-aware design and interaction in computer systems
IBM Systems Journal
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Managing déjà vu: Collection building for the identification of nonidentical duplicate documents
Journal of the American Society for Information Science and Technology - Research Articles
Accurate discovery of co-derivative documents via duplicate text detection
Information Systems
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Distributed text retrieval from overlapping collections
ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms
Proceedings of the 11th international conference on Artificial intelligence and law
Improving web information indexing and retrieval based on center block duplication detection
International Journal of Innovative Computing and Applications
A unified representation of web logs for mining applications
Information Retrieval
Partial duplicate detection for large book collections
Proceedings of the 20th ACM international conference on Information and knowledge management
A systematic study of parameter correlations in large scale duplicate document detection
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Compact features for detection of near-duplicates in distributed retrieval
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Using word clusters to detect similar web documents
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Keeping Found Things Found: The Study and Practice of Personal Information Management: The Study and Practice of Personal Information Management
A system for the proactive, continuous, and efficient collection of digital forensic evidence
Digital Investigation: The International Journal of Digital Forensics & Incident Response
Hi-index | 0.00 |
We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. We compare this to the shingles approach.