Discovering Frequent Poly-Regions in DNA Sequences

Authors:
Panagiotis Papapetrou;Gary Benson;George Kollios
Affiliations:
Boston University;Boston University;Boston University
Venue:
ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Year:
2006

Citing 0
Cited 4

Mining frequent arrangements of temporal intervals

Knowledge and Information Systems
ARTEMIS: assessing the similarity of event-interval sequences

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Distance measure for querying sequences of temporal intervals

Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive Environments
Size matters: finding the most informative set of window lengths

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of discovering arrangements of regions of high occurrence of one or more items of a given alphabet in a sequence, is studied, and two efficient algorithms are proposed. The first one is entropy-based and uses an existing recursive segmentation technique to split the input sequence into a set of homogeneous segments. The key idea of the second approach is to use a set of sliding windows over the sequence. Each sliding window keeps a set of statistics of a sequence segment that mainly includes the number of occurrences of each item in that segment. Combining these statistics efficiently yields the complete set of regions of high occurrence of the items of the given alphabet. After identifying these regions, the sequence is converted to a sequence of labeled intervals (each one corresponding to a region). An efficient algorithm for mining frequent arrangements of event intervals is applied to the converted sequence to discover frequently occurring arrangements of these regions. The proposed algorithms are tested on various DNA sequences producing results with significant biological meaning.