Optimal string mining under frequency constraints

Authors:
Johannes Fischer;Volker Heun;Stefan Kramer
Affiliations:
Institut für Informatik, Ludwig-Maximilians-Universität München, München;Institut für Informatik, Ludwig-Maximilians-Universität München, München;Institut für Informatik/I12, Technische Universität München, Garching b. München
Venue:
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Year:
2006

Citing 17
Cited 10

New indices for text: PAT Trees and PAT arrays

Information retrieval
Recursive star-tree parallel data structure

SIAM Journal on Computing
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Efficient mining of emerging patterns: discovering trends and differences

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Color Set Size Problem with Application to String Matching

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Mining Emerging Substrings

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
A Theory of Inductive Query Answering

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Engineering a Lightweight Suffix Array Construction Algorithm

Algorithmica
On the Complexity of Finding Emerging Patterns

COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Workshops and Fast Abstracts - Volume 02
Fast Frequent String Mining Using Suffix Arrays

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A new representation for protein secondary structure prediction based on frequent patterns

Bioinformatics
Space efficient linear time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
An efficient algorithm for mining string databases under constraints

KDID'04 Proceedings of the Third international conference on Knowledge Discovery in Inductive Databases
Theoretical and practical improvements on the RMQ-Problem, with applications to LCA and LCE

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching

A space efficient solution to the frequent string mining problem for many databases

Data Mining and Knowledge Discovery
Efficient String Mining under Constraints Via the Deferred Frequency Index

ICDM '08 Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects
A Space Efficient Solution to the Frequent String Mining Problem for Many Databases

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
On the Complexity of Constraint-Based Theory Extraction

DS '09 Proceedings of the 12th International Conference on Discovery Science
Faster Algorithms for Computing Maximal Multirepeats in Multiple Sequences

Fundamenta Informaticae - Special Issue on Stringology
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

SIAM Journal on Computing
Faster Algorithms for Computing Maximal Multirepeats in Multiple Sequences

Fundamenta Informaticae - Special Issue on Stringology
Distributed string mining for high-throughput sequencing data

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Sequential pattern mining from trajectory data

Proceedings of the 17th International Database Engineering & Applications Symposium
String analysis by sliding positioning strategy

Journal of Computer and System Sciences

Quantified Score

Hi-index	0.01

Visualization

Abstract

We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ2-test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffix- and lcp-arrays, and a new preprocessing scheme for range minimum queries. The advantages of array-based data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on real-world data from computational biology and demonstrate that the approach also works well in practice.