Species identification based on approximate matching

Authors:
Nagamma Patil;Durga Toshniwal;Kumkum Garg
Affiliations:
Indian Institute of Technology, Roorkee, India;Indian Institute of Technology, Roorkee, India;Manipal Institute of Technology, Manipal, India
Venue:
COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
Year:
2011

Citing 8
Cited 0

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Data mining: concepts and techniques

Data mining: concepts and techniques
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Protein Classification into Domains of Life Using Markov Chain Models

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Combined classifier for unknown genome classification using chaos game representation features

ISB '10 Proceedings of the International Symposium on Biocomputing
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Data mining in soft computing framework: a survey

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Genomic data mining and knowledge extraction is an important problem in bioinformatics. Existing methods for species identification are based on n-grams. In this paper, we propose a novel approach for identification of species. Given a database of genomic sequences, our proposed work includes extraction of all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These patterns are used as features for classifier. Classification of genome sequences has been done by using data mining techniques namely, Naive Bayes, support vector machine and nearest neighbor. Individual classifier accuracies are reported. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely E. coli and Yeast are used to verify proposed method.