Finding similar files in large document repositories

Authors:
George Forman;Kave Eshghi;Stephane Chiocchetti
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard Labs, Palo Alto, CA;Hewlett-Packard France, Courtaboeuf, France
Venue:
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Year:
2005

Citing 4
Cited 15

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Signature extraction for overlap detection in documents

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4

Exploring patterns of social commonality among file directories at work

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient detection of large-scale redundancy in enterprise file systems

ACM SIGOPS Operating Systems Review
Passages through time: chronicling users' information interaction history by recording when and what they read

Proceedings of the 14th international conference on Intelligent user interfaces
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Density analysis of winnowing on non-uniform distributions

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Bimodal content defined chunking for backup streams

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A running time improvement for the two thresholds two divisors algorithm

Proceedings of the 48th Annual Southeast Regional Conference
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Building a high-performance deduplication system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Retrieving similar discussion forum threads: a structure based approach

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Experiments with filtered detection of similar academic papers

AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.