Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A study of cross-validation and bootstrap for accuracy estimation and model selection
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Relational Sequence Alignments and Logos
Inductive Logic Programming
Learning Scoring Schemes for Sequence Alignment from Partial Examples
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Learning Models for Aligning Protein Sequences with Predicted Secondary Structure
RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Support vector training of protein alignment models
RECOMB'07 Proceedings of the 11th annual international conference on Research in computational molecular biology
Multiple sequence alignment based on profile alignment of intermediate sequences
RECOMB'07 Proceedings of the 11th annual international conference on Research in computational molecular biology
Automatic parameter learning for multiple network alignment
RECOMB'08 Proceedings of the 12th annual international conference on Research in computational molecular biology
Inverse sequence alignment from partial examples
WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
GLProbs: Aligning multiple sequences adaptively
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
Hi-index | 0.00 |
In this paper, we present CONTRAlign, an extensible and fully automatic framework for parameter learning and protein pairwise sequence alignment using pair conditional random fields. When learning a substitution matrix and gap penalties from as few as 20 example alignments, CONTRAlign achieves alignment accuracies competitive with available modern tools. As confirmed by rigorous cross-validated testing, CONTRAlign effectively leverages weak biological signals in sequence alignment: using CONTRAlign, we find that hydropathy-based features result in improvements of 5-6% in aligner accuracy for sequences with less than 20% identity, a signal that state-of-the-art hand-tuned aligners are unable to exploit effectively. Furthermore, when known secondary structure and solvent accessibility are available, such external information is naturally incorporated as additional features within the CONTRAlign framework, yielding additional improvements of up to 15-16% in alignment accuracy for low-identity sequences.