A space efficient solution to the frequent string mining problem for many databases

Authors:
Adrian Kügel;Enno Ohlebusch
Affiliations:
Faculty of Engineering and Computer Sciences, University of Ulm, Ulm, Germany 89069;Faculty of Engineering and Computer Sciences, University of Ulm, Ulm, Germany 89069
Venue:
Data Mining and Knowledge Discovery
Year:
2008

Citing 11
Cited 4

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Color Set Size Problem with Application to String Matching

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Engineering a Lightweight Suffix Array Construction Algorithm

Algorithmica
Computing suffix links for suffix trees and arrays

Information Processing Letters
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Space efficient linear time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Simple linear work suffix array construction

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Optimal string mining under frequency constraints

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
A new succinct representation of RMQ-information and improvements in the enhanced suffix array

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies

A Space Efficient Solution to the Frequent String Mining Problem for Many Databases

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Distributed string mining for high-throughput sequencing data

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Sequential pattern mining from trajectory data

Proceedings of the 17th International Database Engineering & Applications Symposium
String analysis by sliding positioning strategy

Journal of Computer and System Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

The frequent string mining problem is to find all substrings of a collection of string databases which satisfy database specific minimum and maximum frequency constraints. Our contribution improves the existing linear-time algorithm for this problem in such a way that the peak memory consumption is a constant factor of the size of the largest database of strings. We show how the results for each database can be stored implicitly in space proportional to the size of the database, making it possible to traverse the results in lexicographical order. Furthermore, we present a linear-time algorithm which calculates the intersection of the results of different databases. This algorithm is based on an algorithm to merge two suffix arrays, and our modification allows us to also calculate the LCP table of the resulting suffix array during the merging.