A Space Efficient Solution to the Frequent String Mining Problem for Many Databases

Authors:
Adrian Kügel;Enno Ohlebusch
Affiliations:
Faculty of Engineering and Computer Sciences, University of Ulm, Ulm, D-89069;Faculty of Engineering and Computer Sciences, University of Ulm, Ulm, D-89069
Venue:
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Year:
2008

Citing 2
Cited 0

A space efficient solution to the frequent string mining problem for many databases

Data Mining and Knowledge Discovery
Optimal string mining under frequency constraints

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the frequent string mining problem, one is given mdatabases ${\cal D}_1,...,{\cal D}_m$ of strings and searches for strings that fulfill certain frequency constraints. The constraints consist of mpairs of thresholds $(\mathit{minf}_1,\mathit{maxf}_1),$ $...,(\mathit{minf}_m,\mathit{maxf}_m)$ and one wants to find all strings 茂戮驴that satisfy $\mathit{minf}_i \le \mathit{freq}(\phi, {\cal D}_i) \le \mathit{maxf}_i$ for all iwith 1 ≤ i≤ m, where $\mathit{freq}(\phi,\mathcal{D}_i) = |\{ \psi \in \mathcal{D}_i : \phi \mbox{ is a substring of } \psi \}|$.Fischer et al. [2] presented an algorithm that solves the frequent string mining problem in linear time under the assumption that the number of databases is treated as a constant. The space consumption of this algorithm, however, is proportional to the total size of all databases. We improve this algorithm in such a way that its space consumption is proportional to the size of the largest database, and it takes linear time regardless of the number of databases. Also, our algorithm is more flexible in the sense that one of several databases can be replaced without having to recalculate everything, that is, intermediate data can be stored on file and be reused.