Verbumculus and the discovery of unusual words

Authors:
Alberto Apostolico;Fang-Cheng Gong;Stefano Lonardi
Affiliations:
Dipartimento di Ingegneria dell' Informazione, Università di Padova, Padova, Italy and Department of Computer Sciences, Purdue University, Computer Sciences Building, West Lafayette, IN;Celera Genomics, 45 W. Gude Drive, Rockville, MD;Department of Computer Science and Engineering, University of California, Riverside, CA
Venue:
Journal of Computer Science and Technology - Special issue on bioinformatics
Year:
2004

Citing 9
Cited 6

Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization

Machine Learning - Special issue on applications in molecular biology
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Pattern matching algorithms

Pattern matching algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Monotony of surprise and large-scale quest for unusual words

Proceedings of the sixth annual international conference on Computational biology
Finding motifs in the twilight zone

Proceedings of the sixth annual international conference on Computational biology
A Technique for Drawing Directed Graphs

IEEE Transactions on Software Engineering
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Color Set Size Problem with Application to String Matching

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching

Algorithms for extracting motifs from biological weighted sequences

Journal of Discrete Algorithms
Linear time algorithm for the longest common repeat problem

Journal of Discrete Algorithms
Unsupervised pattern mining from symbolic temporal data

ACM SIGKDD Explorations Newsletter - Special issue on data mining for health informatics
Visual Exploration of Genomic Data

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
IP6K gene discovery in plant mtDNA

CIBB'10 Proceedings of the 7th international conference on Computational intelligence methods for bioinformatics and biostatistics
Pertinent background knowledge for learning protein grammars

ECML'06 Proceedings of the 17th European conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivable, and yet pose significant computational burden even when limited to words of bounded maximum length. In addition, the display of the huge tables possibly resulting from these counts poses practical problems of visualization and inference.VERBUMCULUS is a suite of software tools for the efficient and fast detection of over- or underrepresented words in nucleotide sequences. The inner core of VERBUMCULUS rests on subtly interwoven properties of statistics, pattern matching and combinatorics on words, that enable one to limit drastically and a priori the set of over-or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. This paper is devoted to the description of the facility at the outset and to report experimental results, ranging from simulations on synthetic data to the discovery of regulatory elements on the upstream regions of a set of genes of the yeast.The software VERBUMCULUS is accessible at http://www.cs.ucr.edu/~stelo/Verbumculus/ or http://wwwdbl. dei.unipd.it/Verbumculus/