Comparative n-gram analysis of whole-genome protein sequences

Authors:
M. Ganapathiraju;D. Weisser;R. Rosenfeld;J. Carbonell;R. Reddy;J. Klein-Seetharaman
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Year:
2002

Citing 5
Cited 9

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Extended application of suffix trees to data compression

DCC '96 Proceedings of the Conference on Data Compression

Context sensitive vocabulary and its application in protein secondary structure prediction

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
BioSim: a biomedical character-based problem solving environment

Future Generation Computer Systems
Could n-gram analysis contribute to genomic island determination?

Journal of Biomedical Informatics
n-Gram characterization of genomic islands in bacterial genomes

Computer Methods and Programs in Biomedicine
BioSim-a biomedical character-based problem solving environment

Future Generation Computer Systems
Towards biomedical problem solving in a game environment

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Collaborative discovery through biological language modeling interface

Ambient Intelligence in Everyday Life
Computational biology and language

Ambient Intelligence for Scientific Discovery
A context evaluation approach for structural comparison of proteins using cross entropy over n-gram modelling

Computers in Biology and Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional three-dimensional shapes of the proteins are clearly different for different organisms, since sequences obtained from human proteins often fail to form correct three-dimensional structures in bacterial organisms. In analogy to the question "What kind of things do people say?" we therefore need to ask the question "What kind of amino acid sequences occur in the proteins of an organism?" An understanding of the sequence space occupied by proteins in different organisms would have important applications for "translation" of proteins from the language of one organism into that of another and design of drugs that target sequences that might be unique or preferred by pathogenic organisms over those in human hosts. Here we describe the development of a biological language modeling toolkit (BLMT) for genome-wide statistical amino acid n-gram analysis and comparison across organisms (freely accessible at www.cs.cmu.edu/~blmt). Its functions were applied to 44 different bacterial, archaeal and the human genome. Amino acid n-gram distribution was found to be characteristic of organisms, as evidenced by (1) the ability of simple Markovian unigram models to distinguish organisms, (2) the marked variation in n-gram distributions across organisms above random variation, and (3) identification of organism-specific phrases in protein sequences that are greater than an order of magnitude standard deviations away from the mean. These lines of evidence suggest that different organisms utilize different "vocabularies" and "phrases", an observation that may provide novel approaches to drug development by specifically targeting these phrases. The results suggest that further detailed analysis of n-gram statistics of protein sequences from whole genomes will likely - in analogy to word n-gram analysis - result in powerful models for prediction, topic classification and information extraction of bilogical sequences.