A new distance metric on strings computable in linear time
Discrete Applied Mathematics
Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Linear Algorithm for Data Compression via String Matching
Journal of the ACM (JACM)
Information Retrieval
Modern Information Retrieval
Clustering web documents: a phrase-based method for grouping search engine results
Clustering web documents: a phrase-based method for grouping search engine results
Efficient Phrase-Based Document Indexing for Web Document Clustering
IEEE Transactions on Knowledge and Data Engineering
PSIST: Indexing Protein Structures Using Suffix Trees
CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
A new suffix tree similarity measure for document clustering
Proceedings of the 16th international conference on World Wide Web
Introduction to Information Retrieval
Introduction to Information Retrieval
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Geometric suffix tree: a new index structure for protein 3-d structures
CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Efficiently querying protein sequences with the proteinus index
BSB'11 Proceedings of the 6th Brazilian conference on Advances in bioinformatics and computational biology
Hi-index | 0.00 |
In this paper, we present a novel algorithm for measuring protein similarity based on their 3-D structure (protein tertiary structure). The algorithm used a suffix tree for discovering common parts of main chains of all proteins appearing in the current research collaboratory for structural bioinformatics protein data bank (PDB). By identifying these common parts, we build a vector model and use some classical information retrieval (IR) algorithms based on the vector model to measure the similarity between proteins--all to all protein similarity. For the calculation of protein similarity, we use term frequency × inverse document frequency (tf × idf) term weighing schema and cosine similarity measure. The goal of this paper is to introduce new protein similarity metric based on suffix trees and IR methods.Whole current PDB database was used to demonstrate very good time complexity of the algorithm as well as high precision.We have chosen the structural classification of proteins (SCOP) database for verification of the precision of our algorithm because it is maintained primarily by humans. The next success of this paper would be the ability to determine SCOP categories of proteins not included in the latest version of the SCOP database (v. 1.75) with nearly 100% precision.