Verbumculus and the discovery of unusual words

  • Authors:
  • Alberto Apostolico;Fang-Cheng Gong;Stefano Lonardi

  • Affiliations:
  • Dipartimento di Ingegneria dell' Informazione, Università di Padova, Padova, Italy and Department of Computer Sciences, Purdue University, Computer Sciences Building, West Lafayette, IN;Celera Genomics, 45 W. Gude Drive, Rockville, MD;Department of Computer Science and Engineering, University of California, Riverside, CA

  • Venue:
  • Journal of Computer Science and Technology - Special issue on bioinformatics
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivable, and yet pose significant computational burden even when limited to words of bounded maximum length. In addition, the display of the huge tables possibly resulting from these counts poses practical problems of visualization and inference.VERBUMCULUS is a suite of software tools for the efficient and fast detection of over- or underrepresented words in nucleotide sequences. The inner core of VERBUMCULUS rests on subtly interwoven properties of statistics, pattern matching and combinatorics on words, that enable one to limit drastically and a priori the set of over-or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. This paper is devoted to the description of the facility at the outset and to report experimental results, ranging from simulations on synthetic data to the discovery of regulatory elements on the upstream regions of a set of genes of the yeast.The software VERBUMCULUS is accessible at http://www.cs.ucr.edu/~stelo/Verbumculus/ or http://wwwdbl. dei.unipd.it/Verbumculus/