Space Efficient String Mining under Frequency Constraints

  • Authors:
  • Johannes Fischer;Veli Mäkinen;Niki Välimäki

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Let $\db_1$ and $\db_2$ be two databases (i.e. multisets) of $d$ strings, over an alphabet $\Sigma$, with overall length $n$. We study the problem of mining discriminative patterns between $\db_1$ and $\db_2$ --- e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is $O(n \log n)$ bits, which is not optimal for $|\Sigma