Mismatch string kernels for discriminative protein classification

Authors:
Christina S. Leslie;Eleazar Eskin;Adiel Cohen;Jason Weston;William Stafford Noble
Affiliations:
Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, Mail Code 0401, New York, NY 10027, USA,;Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, Mail Code 0401, New York, NY 10027, USA,;Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, Mail Code 0401, New York, NY 10027, USA,;Max-Planck Institute for Biological Cybernetics, Spemannstrasse 38, 72076 Tübingen, Germany;Department of Genome Sciences, University of Washington, 1705 NE Pacific Street, Seattle, WA 98195, USA
Venue:
Bioinformatics
Year:
2004

Citing 0
Cited 70

Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Profile-Based String Kernels for Remote Homology Detection and Motif Extraction

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Fast String Kernels using Inexact Matching for Protein Sequences

The Journal of Machine Learning Research
Multi-class protein fold recognition using adaptive codes

ICML '05 Proceedings of the 22nd international conference on Machine learning
Introduction: Special issue on neural networks and kernel methods for structured domains

Neural Networks - Special issue on neural networks and kernel methods for structured domains
2005 Special Issue: A novel approach to extracting features from motif content and protein composition for protein sequence classification

Neural Networks - Special issue on neural networks and kernel methods for structured domains
From Hopfield nets to recursive networks to graph machines: numerical machine learning for structured data

Theoretical Computer Science
Functional Census of Mutation Sequence Spaces: The Example of p53 Cancer Rescue Mutants

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Protein classification using transductive learning on phylogenetic profiles

Proceedings of the 2006 ACM symposium on Applied computing
Classifying protein sequences using hydropathy blocks

Pattern Recognition
Comparing SVM sequence kernels: a protein subcellular localization theme

WISB '06 Proceedings of the 2006 workshop on Intelligent systems for bioinformatics - Volume 73
Typing Staphylococcus aureus Using the spa Gene and Novel Distance Measures

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
RNA string kernels for RNAi off-target evaluation

International Journal of Bioinformatics Research and Applications
A generalization of Haussler's convolution kernel: mapping kernel

Proceedings of the 25th international conference on Machine learning
Linear-Time Computation of Similarity Measures for Sequential Data

The Journal of Machine Learning Research
A Unified String Kernel for Biology Sequence

ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Artificial Intelligence
Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach

WABI '08 Proceedings of the 8th international workshop on Algorithms in Bioinformatics
Brief Communication: A feature vector integration approach for a generalized support vector machine pairwise homology algorithm

Computational Biology and Chemistry
Neural networks letter: LAGO on the unit sphere

Neural Networks
g-MARS: Protein Classification Using Gapped Markov Chains and Support Vector Machines

PRIB '08 Proceedings of the Third IAPR International Conference on Pattern Recognition in Bioinformatics
Ensembled support vector machines for human papillomavirus risk type prediction from protein secondary structures

Computers in Biology and Medicine
A Class of Evolution-Based Kernels for Protein Homology Analysis: A Generalization of the PAM Model

ISBRA '09 Proceedings of the 5th International Symposium on Bioinformatics Research and Applications
Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
A randomized string kernel and its application to RNA interference

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
On prediction using variable order Markov models

Journal of Artificial Intelligence Research
Human activity encoding and recognition using low-level visual features

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Protein Structure Classification Based on Conserved Hydrophobic Residues

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Coding of amino acids by texture descriptors

Artificial Intelligence in Medicine
Mining Cytochrome b561 proteins from plant genomes

International Journal of Bioinformatics Research and Applications
Learning state machine-based string edit kernels

Pattern Recognition
Protein remote homology detection based on binary profiles

BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
Learning actions using robust string kernels

Proceedings of the 2nd conference on Human motion: understanding, modeling, capture and animation
Prediction of alternatively spliced exons using Support Vector Machines

International Journal of Data Mining and Bioinformatics
Classifying proteins using gapped Markov feature pairs

Neurocomputing
A composite kernel for named entity recognition

Pattern Recognition Letters
A Study of Hierarchical and Flat Classification of Proteins

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Large-scale support vector learning with structural kernels

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Using local alignments for relation recognition

Journal of Artificial Intelligence Research
A sum-over-paths extension of edit distances accounting for all sequence alignments

Pattern Recognition
A generalization of Haussler's convolution kernel: mapping kernel and its application to tree kernels

Journal of Computer Science and Technology
Protein remote homology detection based on auto-cross covariance transformation

Computers in Biology and Medicine
Remote protein homology detection and fold recognition using two-layer support vector machine classifiers

Computers in Biology and Medicine
Bounded coordinate-descent for biological sequence classification in high dimensional predictor space

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Accuracy of string kernels for protein sequence classification

ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I
Efficient algorithms for similarity measures over sequential data: a look beyond kernels

DAGM'06 Proceedings of the 28th conference on Pattern Recognition
SVM based prediction of bacterial transcription start sites

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Prediction of the human papillomavirus risk types using gap-spectrum kernels

ISNN'06 Proceedings of the Third international conference on Advances in Neural Networks - Volume Part III
Human papillomavirus risk type classification from protein sequences using support vector machines

EuroGP'06 Proceedings of the 2006 international conference on Applications of Evolutionary Computing
Computational and statistical methods in bioinformatics

AM'03 Proceedings of the Second international conference on Active Mining
Classification of chromosome sequences with entropy kernel and LKPLS algorithm

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
String kernels of imperfect matches for off-target detection in RNA interference

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
A new kernel based on high-scored pairs of tri-peptides and its application in prediction of protein subcellular localization

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
A solution to the curse of dimensionality problem in pairwise scoring techniques

ICONIP'06 Proceedings of the 13 international conference on Neural Information Processing - Volume Part I
Classification of biological sequences with kernel methods

ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
A class of new kernels based on high-scored pairs of k-peptides for SVMs and its application for prediction of protein subcellular localization

Transactions on Computational Systems Biology II
Efficient target detection for RNA interference

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
Prediction of human proteins interacting with human papillomavirus proteins

ICIC'11 Proceedings of the 7th international conference on Intelligent Computing: bio-inspired computing and applications
A hidden Markov model variant for sequence classification

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Efficient evaluation of large sequence kernels

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Transductive multi-label ensemble classification for protein function prediction

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Protein function prediction using weak-label learning

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Similarity measures for sequential data

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Locality kernels for protein classification

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
A family of feed-forward models for protein sequence classification

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Evolving fisher kernels for biological sequence classification

Evolutionary Computation
Classifying Proteins by Amino Acid Variations of Sequential Patterns

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
Characterizing Amino Acid Variations of Scavenger Receptors by Class Information Gain

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
The gapped spectrum kernel for support vector machines

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Protein Function Prediction using Multi-label Ensemble Classification

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Remote homology detection incorporating the context of physicochemical properties

Computers in Biology and Medicine

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability: SVM software is publicly available at http://microarray.cpmc.columbia.edu/gist. Mismatch kernel software is available upon request.