Communications of the ACM
Fast text searching: allowing errors
Communications of the ACM
A theory of parameterized pattern matching: algorithms and applications
STOC '93 Proceedings of the twenty-fifth annual ACM symposium on Theory of computing
ECDL '01 Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries
Estimating Resemblance of MIDI Documents
ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On finding duplication and near-duplication in large software systems
WCRE '95 Proceedings of the Second Working Conference on Reverse Engineering
PeerStore: Better Performance by Relaxing in Peer-to-Peer Backup
P2P '04 Proceedings of the Fourth International Conference on Peer-to-Peer Computing
Improving Bandwidth Efficiency of Peer-to-Peer Storage
P2P '04 Proceedings of the Fourth International Conference on Peer-to-Peer Computing
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code
IEEE Transactions on Software Engineering
A Dual-Method Model for Copy Detection
WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Consistency-preserving caching of dynamic database content
Proceedings of the 16th international conference on World Wide Web
Improving mobile database access over wide-area networks without degrading consistency
Proceedings of the 5th international conference on Mobile systems, applications and services
Distributed text retrieval from overlapping collections
ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
Essential deduplication functions for transactional databases in law firms
Proceedings of the 11th international conference on Artificial intelligence and law
Multiple-signal duplicate detection for search evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Combinatorial algorithms for web search engines: three success stories
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Highly efficient techniques for network forensics
Proceedings of the 14th ACM conference on Computer and communications security
Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
Hyperspaces for object clustering and approximate matching in peer-to-peer overlays
HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
Supporting practical content-addressable caching with CZIP compression
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Avoiding the disk bottleneck in the data domain deduplication file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Packet caches on routers: the implications of universal redundant traffic elimination
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Nuisance level of a voice call
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Approximate object location and spam filtering on peer-to-peer systems
Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
Efficient overlap and content reuse detection in blogs and online news articles
Proceedings of the 18th international conference on World wide web
Comparison and evaluation of code clone detection techniques and tools: A qualitative approach
Science of Computer Programming
The design of a similarity based deduplication system
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Redundancy in network traffic: findings and implications
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Applying syntactic similarity algorithms for enterprise information management
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Harvesting Large-Scale Grids for Software Resources
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Near-duplicate detection for web-forums
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
Exploiting Sentence-Level Features for Near-Duplicate Document Detection
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
New payload attribution methods for network forensic investigations
ACM Transactions on Information and System Security (TISSEC)
Experimental study of protocol-independent redundancy elimination algorithms
Proceedings of the first joint WOSP/SIPEW international conference on Performance engineering
On compressing the textual web
Proceedings of the third ACM international conference on Web search and data mining
Detecting visually similar Web pages: Application to phishing detection
ACM Transactions on Internet Technology (TOIT)
Using transparent compression to improve SSD-based I/O caches
Proceedings of the 5th European conference on Computer systems
Systems support for remote visualization of genomics applications over wide area networks
GCCB'06 Proceedings of the 2006 international conference on Distributed, high-performance and grid computing in computational biology
Density analysis of winnowing on non-uniform distributions
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Differences and identities in document retrieval in an annotation environment
DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
Sampling dirty data for matching attributes
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient similarity estimation for systems exploiting data redundancy
INFOCOM'10 Proceedings of the 29th conference on Information communications
HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Wide-area network acceleration for the developing world
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Improving audio files availability in file sharing networks
WebMedia '09 Proceedings of the XV Brazilian Symposium on Multimedia and the Web
IEEE Transactions on Information Theory
Estimating set intersection using small samples
ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
Detection of simple plagiarism in computer science papers
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Large scale parallel document mining for machine translation
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Efficient non-linear editing for non-volatile mobile storage
Proceedings of the 2010 ACM multimedia workshop on Mobile cloud media computing
Automatic detection of local reuse
EC-TEL'10 Proceedings of the 5th European conference on Technology enhanced learning conference on Sustaining TEL: from innovation to learning and practice
Detecting and filtering instant messaging spam: a global and personalized approach
NPSEC'05 Proceedings of the First international conference on Secure network protocols
Facilitating interaction and retrieval for annotated documents
International Journal of Computational Science and Engineering
Efficient indexing of repeated n-grams
Proceedings of the fourth ACM international conference on Web search and data mining
Detecting near-duplicate relations in user generated forum content
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Tradeoffs in scalable data routing for deduplication clusters
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Foundations and Trends in Information Retrieval
A driver-layer caching policy for removable storage devices
ACM Transactions on Storage (TOS)
Venti: a new approach to archival storage
FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Integrating portable and distributed storage
FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
Exploiting similarity for multi-source downloads using file handprints
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Schema mapping with quality assurance for data integration
APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Studying software evolution using artefacts' shared information content
Science of Computer Programming
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A preprocessing framework and approach for web applications
Journal of Web Engineering
Function clone detection in web applications: a semiautomated approach
Journal of Web Engineering
On the evolution of clusters of near-duplicate web pages
Journal of Web Engineering
Retrieving similar documents from the web
Journal of Web Engineering
SizeSpotSigs: an effective deduplicate algorithm considering the size of page content
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
The power of prediction: cloud bandwidth and cost reduction
Proceedings of the ACM SIGCOMM 2011 conference
Partial duplicate detection for large book collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Mining relational structure from millions of books: position paper
Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
DeFFS: Duplication-eliminated flash file system
Computers and Electrical Engineering
Compact features for detection of near-duplicates in distributed retrieval
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
The case of the duplicate documents measurement, search, and science
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Using word clusters to detect similar web documents
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Temporal shingling for version identification in web archives
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
A sentence-based copy detection approach for web documents
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
String matching on the internet
CAAN'04 Proceedings of the First international conference on Combinatorial and Algorithmic Aspects of Networking
Measuring similarity of large software systems based on source code correspondence
PROFES'05 Proceedings of the 6th international conference on Product Focused Software Process Improvement
Clustering near-identical sequences for fast homology search
RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Transparent Online Storage Compression at the Block-Level
ACM Transactions on Storage (TOS)
Fast discovery of similar sequences in large genomic collections
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
WAN optimized replication of backup datasets using stream-informed delta compression
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
iDedup: latency-aware, inline data deduplication for primary storage
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Minersoft: Software retrieval in grid and cloud computing infrastructures
ACM Transactions on Internet Technology (TOIT)
Teleporter: An analytically and forensically sound duplicate transfer system
Digital Investigation: The International Journal of Digital Forensics & Incident Response
A system for the proactive, continuous, and efficient collection of digital forensic evidence
Digital Investigation: The International Journal of Digital Forensics & Incident Response
md5bloom: Forensic filesystem hashing revisited
Digital Investigation: The International Journal of Digital Forensics & Incident Response
Delta compressed and deduplicated storage using stream-informed locality
HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
WAN-optimized replication of backup datasets using stream-informed delta compression
ACM Transactions on Storage (TOS)
Measuring semantic relatedness using multilingual representations
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Experiments with filtered detection of similar academic papers
AIMSA'12 Proceedings of the 15th international conference on Artificial Intelligence: methodology, systems, and applications
Space savings and design considerations in variable length deduplication
ACM SIGOPS Operating Systems Review
Robust plagiary detection using semantic compression augmented SHAPD
ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I
NIFTY: a system for large scale information flow tracking and clustering
Proceedings of the 22nd international conference on World Wide Web
Revision graph extraction in Wikipedia based on supergram decomposition
Proceedings of the 9th International Symposium on Open Collaboration
CoBAn: A context based model for data leakage prevention
Information Sciences: an International Journal
PACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
IEEE/ACM Transactions on Networking (TON)
Memory efficient sanitization of a deduplicated storage system
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
File recipe compression in data deduplication systems
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
EsPRESSO: Efficient privacy-preserving evaluation of sample set similarity
Journal of Computer Security
Hi-index | 0.06 |
We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they have significant number of common pieces, even if they are very different otherwise. For example, one file may be contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The running time for finding all groups of similar files, even for as little as 25% similarity, is on the order of 500MB to 1GB an hour. The amount of similarity and several other customized parameters can be determined by the user at a post-processing stage, which is very fast. Sif can also be used to very quickly identify all similar files to a query file using a preprocessed index. Application of sif can be found in file management, information collecting (to remove duplicates), program reuse, file synchronization, data compression, and maybe even plagiarism detection.