CONTRAlign: discriminative training for protein sequence alignment

Authors:
Chuong B. Do;Samuel S. Gross;Serafim Batzoglou
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Year:
2006

Citing 5
Cited 9

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
SABmark---a benchmark for sequence alignment that covers the entire known fold space

Bioinformatics
SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures

Bioinformatics
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Relational Sequence Alignments and Logos

Inductive Logic Programming
Learning Scoring Schemes for Sequence Alignment from Partial Examples

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Learning Models for Aligning Protein Sequences with Predicted Secondary Structure

RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Support vector training of protein alignment models

RECOMB'07 Proceedings of the 11th annual international conference on Research in computational molecular biology
Multiple sequence alignment based on profile alignment of intermediate sequences

RECOMB'07 Proceedings of the 11th annual international conference on Research in computational molecular biology
Automatic parameter learning for multiple network alignment

RECOMB'08 Proceedings of the 12th annual international conference on Research in computational molecular biology
A sum-over-paths extension of edit distances accounting for all sequence alignments

Pattern Recognition
Inverse sequence alignment from partial examples

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
GLProbs: Aligning multiple sequences adaptively

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present CONTRAlign, an extensible and fully automatic framework for parameter learning and protein pairwise sequence alignment using pair conditional random fields. When learning a substitution matrix and gap penalties from as few as 20 example alignments, CONTRAlign achieves alignment accuracies competitive with available modern tools. As confirmed by rigorous cross-validated testing, CONTRAlign effectively leverages weak biological signals in sequence alignment: using CONTRAlign, we find that hydropathy-based features result in improvements of 5-6% in aligner accuracy for sequences with less than 20% identity, a signal that state-of-the-art hand-tuned aligners are unable to exploit effectively. Furthermore, when known secondary structure and solvent accessibility are available, such external information is naturally incorporated as additional features within the CONTRAlign framework, yielding additional improvements of up to 15-16% in alignment accuracy for low-identity sequences.