Elements of information theory
Elements of information theory
An introduction to Kolmogorov complexity and its applications (2nd ed.)
An introduction to Kolmogorov complexity and its applications (2nd ed.)
DNA sequence compression using the normalized maximum likelihood model for discrete regression
DCC '03 Proceedings of the Conference on Data Compression
Some Theory and Practice of Greedy Off-Line Textual Substitution
DCC '98 Proceedings of the Conference on Data Compression
Kolmogorov complexity estimation and application for information system security
Kolmogorov complexity estimation and application for information system security
Advances in Minimum Description Length: Theory and Applications (Neural Information Processing)
Advances in Minimum Description Length: Theory and Applications (Neural Information Processing)
An optimal DNA segmentation based on the MDL principle
International Journal of Bioinformatics Research and Applications
Identifying hierarchical structure in sequences: a linear-time algorithm
Journal of Artificial Intelligence Research
DNA compression challenge revisited: a dynamic programming approach
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
IEEE Transactions on Information Theory
MILCOM'09 Proceedings of the 28th IEEE conference on Military communications
Choosing word occurrences for the smallest grammar problem
LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
Hi-index | 0.00 |
We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammar-based coding methods, such as DNA Sequitur, while retaining a two-part code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit two-part coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation.