Searching for smallest grammars on large sequences and application to DNA

Authors:
Rafael Carrascosa;François Coste;Matthias Gallé;Gabriel Infante-Lopez
Affiliations:
Grupo de Procesamiento de Lenguaje Natural, Universidad Nacional de Córdoba, Argentina;Symbiose Project, IRISA/INRIA Rennes-Bretagne Atlantique, France;Symbiose Project, IRISA/INRIA Rennes-Bretagne Atlantique, France;Grupo de Procesamiento de Lenguaje Natural, Universidad Nacional de Córdoba, Argentina and Consejo Nacional de Investigaciones Científicas y Técnicas, Argentina
Venue:
Journal of Discrete Algorithms
Year:
2012

Citing 13
Cited 1

Text compression

Text compression
A new challenge for compression algorithms: genetic sequences

Information Processing and Management: an International Journal - Special issue: data compression
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Solving the String Statistics Problem in Time O(n log n)

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
A Corpus for the Evaluation of Lossless Compression Algorithms

DCC '97 Proceedings of the Conference on Data Compression
Data Compression Using Long Common Strings

DCC '99 Proceedings of the Conference on Data Compression
Compression of Biological Sequences by Greedy Off-Line Textual Substitution

DCC '00 Proceedings of the Conference on Data Compression
Application of Lempel--Ziv factorization to the approximation of grammar-based compression

Theoretical Computer Science
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
The unsupervised learning of natural language structure

The unsupervised learning of natural language structure
Choosing word occurrences for the smallest grammar problem

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
Grammar-based codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory
The smallest grammar problem

IEEE Transactions on Information Theory

An effective heuristic for the smallest grammar problem

Proceedings of the 15th annual conference on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms.