Discovery of Delta Closed Patterns and Noninduced Patterns from Sequences

  • Authors:
  • Andrew K. C. Wong;Dennis Zhuang;Gary C. L. Li;En-Shiun Annie Lee

  • Affiliations:
  • University of Waterloo, Waterloo;University of Waterloo, Waterloo;University of Waterloo, Waterloo;University of Waterloo, Waterloo

  • Venue:
  • IEEE Transactions on Knowledge and Data Engineering
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Discovering patterns from sequence data has significant impact in many aspects of science and society, especially in genomics and proteomics. Here we consider multiple strings as input sequence data and substrings as patterns. In the real world, usually a large set of patterns could be discovered yet many of them are redundant, thus degrading the output quality. This paper improves the output quality by removing two types of redundant patterns. First, the notion of delta tolerance closed itemset is employed to remove redundant patterns that are not delta closed. Second, the concept of statistically induced patterns is proposed to capture redundant patterns which seem to be statistically significant yet their significance is induced by their strong significant subpatterns. It is computationally intense to mine these nonredundant patterns (delta closed patterns and noninduced patterns). To efficiently discover these patterns in very large sequence data, two efficient algorithms have been developed through innovative use of suffix tree. Three sets of experiments were conducted to evaluate their performance. They render excellent results when applying to genomics. The experiments confirm that the proposed algorithms are efficient and that they produce a relatively small set of patterns which reveal interesting information in the sequences.