Annotated stochastic context free grammars for analysis and synthesis of proteins

Authors:
Eva Sciacca;Salvatore Spinella;Dino Ienco;Paola Giannini
Affiliations:
Dipartimento di Informatica, Università di Torino, Torino, Italy;Dipartimento di Informatica, Università di Torino, Torino, Italy;Dipartimento di Informatica, Università di Torino, Torino, Italy;Dipartimento di Informatica, Università di Torino, Torino, Italy and Dipartimento di Informatica, Universitá del Piemonte Orientale, Alessandria, Italy
Venue:
EvoBIO'11 Proceedings of the 9th European conference on Evolutionary computation, machine learning and data mining in bioinformatics
Year:
2011

Citing 7
Cited 0

The computational linguistics of biological sequences

Artificial intelligence and molecular biology
Predicting Protein Secondary Structure Using Stochastic Tree Grammars

Machine Learning - Special issue on learning with probabilistic representations
Modeling and predicting all-α transmembrane proteins including helix-helix pairing

Theoretical Computer Science - Pattern discovery in the post genome
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Parameter-Free Hierarchical Co-clustering by n-Ary Splits

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research
Protein motif prediction by grammatical inference

ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important step to understand the main functions of a specific family of proteins is the detection of protein features that could reveal how protein chains are constituted. To achieve this aim we treated amino acid sequences of proteins as a formal language, building a Context-Free Grammar annotated using an n-gram Bayesian classifier. This formalism is able to analyze the connection between protein chains and protein functions. In order to design new protein chains with the properties of the considered family we performed a rule clustering of the grammar to build an Annotated Stochastic Context Free Grammar. Our methodology was applied to a class of Antimicrobial Peptides (AmPs): the Frog antimicrobial peptides family. Through this case study, our approach pointed out some important aspects regarding the relationship between sequences and functional domains of proteins and how protein domain motifs are preserved by natural evolution in to the amino acid sequences. Moreover our results suggest that the synthesis of new proteins with a given domain architecture can be one of the fields where application of Annotated Stochastic Context Free Grammars can be useful.