Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Improved Gapped Alignment in BLAST
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Redundant documents and search effectiveness
Proceedings of the 14th ACM international conference on Information and knowledge management
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Clustering near-identical sequences for fast homology search
RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Progress in information retrieval
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Hi-index | 0.00 |
Detection of highly similar sequences within genomic collections has a number of applications, including the assembly of expressed sequence tag data, genome comparison, and clustering sequence collections for improved search speed and accuracy. While several approaches exist for this task, they are becoming infeasible — either in space or in time — as genomic collections continue to grow at a rapid pace. In this paper we present an approach based on document fingerprinting for identifying highly similar sequences. Our approach uses a modest amount of memory and executes in a time roughly proportional to the size of the collection. We demonstrate substantial speed improvements compared to the CD-HIT algorithm, the most successful existing approach for clustering large protein sequence collections.