Monotony of surprise and large-scale quest for unusual words

Authors:
Alberto Apostolico;Mary Ellen Bock;Stefano Lonardi
Affiliations:
Universitá di Padova, Padova, Italy and Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN;University of California, Riverside, CA
Venue:
Proceedings of the sixth annual international conference on Computational biology
Year:
2002

Citing 4
Cited 14

Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Pattern matching algorithms

Pattern matching algorithms
Annotated Statistical Indices for Sequence Analysis

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Global detectors of unusual words: design, implementation, and applications to pattern discovery in biosequences

Global detectors of unusual words: design, implementation, and applications to pattern discovery in biosequences

Finding surprising patterns in a time series database in linear time and space

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A symbolic representation of time series, with implications for streaming algorithms

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Probabilistic discovery of time series motifs

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Verbumculus and the discovery of unusual words

Journal of Computer Science and Technology - Special issue on bioinformatics
Visually mining and monitoring massive time series

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Visualizing and discovering non-trivial patterns in large time series databases

Information Visualization
Motif discovery by monotone scores

Discrete Applied Mathematics
Experiencing SAX: a novel symbolic representation of time series

Data Mining and Knowledge Discovery
Maximal and minimal representations of gapped and non-gapped motifs of a string

Theoretical Computer Science
Efficient selection of unique and popular oligos for large EST databases

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Approximate variable-length time series motif discovery using grammar inference

Proceedings of the Tenth International Workshop on Multimedia Data Mining
Palmprint authentication using time series

AVBPA'05 Proceedings of the 5th international conference on Audio- and Video-Based Biometric Person Authentication
A clustering algorithm based on distinguishability for nominal attributes

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Faster variance computation for patterns with gaps

MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subwords of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.