Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
q-gram based database searching using a suffix array (QUASAR)
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Extended application of suffix trees to data compression
DCC '96 Proceedings of the Conference on Data Compression
Context sensitive vocabulary and its application in protein secondary structure prediction
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
BioSim: a biomedical character-based problem solving environment
Future Generation Computer Systems
Could n-gram analysis contribute to genomic island determination?
Journal of Biomedical Informatics
n-Gram characterization of genomic islands in bacterial genomes
Computer Methods and Programs in Biomedicine
BioSim-a biomedical character-based problem solving environment
Future Generation Computer Systems
Towards biomedical problem solving in a game environment
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Collaborative discovery through biological language modeling interface
Ambient Intelligence in Everyday Life
Computational biology and language
Ambient Intelligence for Scientific Discovery
Computers in Biology and Medicine
Hi-index | 0.00 |
A current barrier for successful rational drug design is the lack of understanding of the structure space provided by the proteins in a cell that is determined by their sequence space. The protein sequences capable of folding to functional three-dimensional shapes of the proteins are clearly different for different organisms, since sequences obtained from human proteins often fail to form correct three-dimensional structures in bacterial organisms. In analogy to the question "What kind of things do people say?" we therefore need to ask the question "What kind of amino acid sequences occur in the proteins of an organism?" An understanding of the sequence space occupied by proteins in different organisms would have important applications for "translation" of proteins from the language of one organism into that of another and design of drugs that target sequences that might be unique or preferred by pathogenic organisms over those in human hosts. Here we describe the development of a biological language modeling toolkit (BLMT) for genome-wide statistical amino acid n-gram analysis and comparison across organisms (freely accessible at www.cs.cmu.edu/~blmt). Its functions were applied to 44 different bacterial, archaeal and the human genome. Amino acid n-gram distribution was found to be characteristic of organisms, as evidenced by (1) the ability of simple Markovian unigram models to distinguish organisms, (2) the marked variation in n-gram distributions across organisms above random variation, and (3) identification of organism-specific phrases in protein sequences that are greater than an order of magnitude standard deviations away from the mean. These lines of evidence suggest that different organisms utilize different "vocabularies" and "phrases", an observation that may provide novel approaches to drug development by specifically targeting these phrases. The results suggest that further detailed analysis of n-gram statistics of protein sequences from whole genomes will likely - in analogy to word n-gram analysis - result in powerful models for prediction, topic classification and information extraction of bilogical sequences.