Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
A low-bandwidth network file system
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Signature extraction for overlap detection in documents
ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Exploring patterns of social commonality among file directories at work
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient detection of large-scale redundancy in enterprise file systems
ACM SIGOPS Operating Systems Review
Proceedings of the 14th international conference on Intelligent user interfaces
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
Applying syntactic similarity algorithms for enterprise information management
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Density analysis of winnowing on non-uniform distributions
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Bimodal content defined chunking for backup streams
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A running time improvement for the two thresholds two divisors algorithm
Proceedings of the 48th Annual Southeast Regional Conference
Efficient exact edit similarity query processing with the asymmetric signature scheme
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Building a high-performance deduplication system
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Retrieving similar discussion forum threads: a structure based approach
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Experiments with filtered detection of similar academic papers
AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Asymmetric signature schemes for efficient exact edit similarity query processing
ACM Transactions on Database Systems (TODS)
Hi-index | 0.00 |
Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.