Finding similar files in a large file system

Authors:
Udi Manber
Affiliations:
Department of Computer Science, University of Arizona, Tucson, AZ
Venue:
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Year:
1994

Citing 4
Cited 111

On visual formalisms

Communications of the ACM
Detecting duplicates: a searcher's dream come true

Online
Fast text searching: allowing errors

Communications of the ACM
A theory of parameterized pattern matching: algorithms and applications

STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing

Using Copy-Detection and Text Comparison Algorithms for Cross-Referencing Multiple Editions of Literary Works

ECDL '01 Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries
Estimating Resemblance of MIDI Documents

ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On finding duplication and near-duplication in large software systems

WCRE '95 Proceedings of the Second Working Conference on Reverse Engineering
PeerStore: Better Performance by Relaxing in Peer-to-Peer Backup

P2P '04 Proceedings of the Fourth International Conference on Peer-to-Peer Computing
Improving Bandwidth Efficiency of Peer-to-Peer Storage

P2P '04 Proceedings of the Fourth International Conference on Peer-to-Peer Computing
Fractal: A Mobile Code Based Framework for Dynamic Application Protocol Adaptation in Pervasive Computing

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code

IEEE Transactions on Software Engineering
A Dual-Method Model for Copy Detection

WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Consistency-preserving caching of dynamic database content

Proceedings of the 16th international conference on World Wide Web
Improving mobile database access over wide-area networks without degrading consistency

Proceedings of the 5th international conference on Mobile systems, applications and services
Distributed text retrieval from overlapping collections

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms

Proceedings of the 11th international conference on Artificial intelligence and law
Multiple-signal duplicate detection for search evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Content-based document routing and index partitioning for scalable similarity-based searches in a large corpus

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Combinatorial algorithms for web search engines: three success stories

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Highly efficient techniques for network forensics

Proceedings of the 14th ACM conference on Computer and communications security
A dynamic birthmark for java

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Hyperspaces for object clustering and approximate matching in peer-to-peer overlays

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
Supporting practical content-addressable caching with CZIP compression

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Packet caches on routers: the implications of universal redundant traffic elimination

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Nuisance level of a voice call

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Approximate object location and spam filtering on peer-to-peer systems

Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

Science of Computer Programming
The design of a similarity based deduplication system

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Redundancy in network traffic: findings and implications

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Harvesting Large-Scale Grids for Software Resources

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Near-duplicate detection for web-forums

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Exploiting Sentence-Level Features for Near-Duplicate Document Detection

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
New payload attribution methods for network forensic investigations

ACM Transactions on Information and System Security (TISSEC)
Experimental study of protocol-independent redundancy elimination algorithms

Proceedings of the first joint WOSP/SIPEW international conference on Performance engineering
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
Detecting visually similar Web pages: Application to phishing detection

ACM Transactions on Internet Technology (TOIT)
Using transparent compression to improve SSD-based I/O caches

Proceedings of the 5th European conference on Computer systems
Systems support for remote visualization of genomics applications over wide area networks

GCCB'06 Proceedings of the 2006 international conference on Distributed, high-performance and grid computing in computational biology
Density analysis of winnowing on non-uniform distributions

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Differences and identities in document retrieval in an annotation environment

DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient similarity estimation for systems exploiting data redundancy

INFOCOM'10 Proceedings of the 29th conference on Information communications
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Wide-area network acceleration for the developing world

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Improving audio files availability in file sharing networks

WebMedia '09 Proceedings of the XV Brazilian Symposium on Multimedia and the Web
Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem

IEEE Transactions on Information Theory
Estimating set intersection using small samples

ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
Detection of simple plagiarism in computer science papers

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Large scale parallel document mining for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Efficient non-linear editing for non-volatile mobile storage

Proceedings of the 2010 ACM multimedia workshop on Mobile cloud media computing
Automatic detection of local reuse

EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Detecting and filtering instant messaging spam: a global and personalized approach

NPSEC'05 Proceedings of the First international conference on Secure network protocols
Facilitating interaction and retrieval for annotated documents

International Journal of Computational Science and Engineering
Efficient indexing of repeated n-grams

Proceedings of the fourth ACM international conference on Web search and data mining
Detecting near-duplicate relations in user generated forum content

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Tradeoffs in scalable data routing for deduplication clusters

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Federated Search

Foundations and Trends in Information Retrieval
A driver-layer caching policy for removable storage devices

ACM Transactions on Storage (TOS)
Venti: a new approach to archival storage

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Integrating portable and distributed storage

FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
Exploiting similarity for multi-source downloads using file handprints

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Schema mapping with quality assurance for data integration

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Studying software evolution using artefacts' shared information content

Science of Computer Programming
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A preprocessing framework and approach for web applications

Journal of Web Engineering
Function clone detection in web applications: a semiautomated approach

Journal of Web Engineering
On the evolution of clusters of near-duplicate web pages

Journal of Web Engineering
Retrieving similar documents from the web

Journal of Web Engineering
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
The power of prediction: cloud bandwidth and cost reduction

Proceedings of the ACM SIGCOMM 2011 conference
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Mining relational structure from millions of books: position paper

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
DeFFS: Duplication-eliminated flash file system

Computers and Electrical Engineering
Compact features for detection of near-duplicates in distributed retrieval

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
The case of the duplicate documents measurement, search, and science

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Using word clusters to detect similar web documents

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
A sentence-based copy detection approach for web documents

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
String matching on the internet

CAAN'04 Proceedings of the First international conference on Combinatorial and Algorithmic Aspects of Networking
Measuring similarity of large software systems based on source code correspondence

PROFES'05 Proceedings of the 6th international conference on Product Focused Software Process Improvement
Clustering near-identical sequences for fast homology search

RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Transparent Online Storage Compression at the Block-Level

ACM Transactions on Storage (TOS)
Fast discovery of similar sequences in large genomic collections

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
WAN optimized replication of backup datasets using stream-informed delta compression

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Minersoft: Software retrieval in grid and cloud computing infrastructures

ACM Transactions on Internet Technology (TOIT)
Teleporter: An analytically and forensically sound duplicate transfer system

Digital Investigation: The International Journal of Digital Forensics & Incident Response
A system for the proactive, continuous, and efficient collection of digital forensic evidence

Digital Investigation: The International Journal of Digital Forensics & Incident Response
md5bloom: Forensic filesystem hashing revisited

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Delta compressed and deduplicated storage using stream-informed locality

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
WAN-optimized replication of backup datasets using stream-informed delta compression

ACM Transactions on Storage (TOS)
Measuring semantic relatedness using multilingual representations

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Experiments with filtered detection of similar academic papers

AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Space savings and design considerations in variable length deduplication

ACM SIGOPS Operating Systems Review
Robust plagiary detection using semantic compression augmented SHAPD

ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I
NIFTY: a system for large scale information flow tracking and clustering

Proceedings of the 22nd international conference on World Wide Web
Revision graph extraction in Wikipedia based on supergram decomposition

Proceedings of the 9th International Symposium on Open Collaboration
CoBAn: A context based model for data leakage prevention

Information Sciences: an International Journal
PACK: Prediction-Based Cloud Bandwidth and Cost Reduction System

IEEE/ACM Transactions on Networking (TON)
Memory efficient sanitization of a deduplicated storage system

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
File recipe compression in data deduplication systems

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
EsPRESSO: Efficient privacy-preserving evaluation of sample set similarity

Journal of Computer Security

Quantified Score

Hi-index	0.06

Visualization

Abstract

We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The running time for finding all groups of similar files, even for as little as 25% similarity, is on the order of 500MB to 1GB an hour. The amount of similarity and several other customized parameters can be determined by the user at a post-processing stage, which is very fast. Sif can also be used to very quickly identify all similar files to a query file using a preprocessed index. Application of sif can be found in file management, information collecting (to remove duplicates), program reuse, file synchronization, data compression, and maybe even plagiarism detection.