New indices for text: PAT Trees and PAT arrays
Information retrieval
Recursive star-tree parallel data structure
SIAM Journal on Computing
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Efficient mining of emerging patterns: discovering trends and differences
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Color Set Size Problem with Application to String Matching
CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
A Theory of Inductive Query Answering
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Replacing suffix trees with enhanced suffix arrays
Journal of Discrete Algorithms - SPIRE 2002
On the Complexity of Finding Emerging Patterns
COMPSAC '04 Proceedings of the 28th Annual International Computer Software and Applications Conference - Workshops and Fast Abstracts - Volume 02
Fast Frequent String Mining Using Suffix Arrays
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Space efficient linear time construction of suffix arrays
CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
An efficient algorithm for mining string databases under constraints
KDID'04 Proceedings of the Third international conference on Knowledge Discovery in Inductive Databases
Theoretical and practical improvements on the RMQ-Problem, with applications to LCA and LCE
CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
A space efficient solution to the frequent string mining problem for many databases
Data Mining and Knowledge Discovery
Efficient String Mining under Constraints Via the Deferred Frequency Index
ICDM '08 Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects
A Space Efficient Solution to the Frequent String Mining Problem for Many Databases
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
On the Complexity of Constraint-Based Theory Extraction
DS '09 Proceedings of the 12th International Conference on Discovery Science
Faster Algorithms for Computing Maximal Multirepeats in Multiple Sequences
Fundamenta Informaticae - Special Issue on Stringology
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
SIAM Journal on Computing
Faster Algorithms for Computing Maximal Multirepeats in Multiple Sequences
Fundamenta Informaticae - Special Issue on Stringology
Distributed string mining for high-throughput sequencing data
WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Sequential pattern mining from trajectory data
Proceedings of the 17th International Database Engineering & Applications Symposium
String analysis by sliding positioning strategy
Journal of Computer and System Sciences
Hi-index | 0.01 |
We propose a new algorithmic framework that solves frequency-related data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ2-test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffix- and lcp-arrays, and a new preprocessing scheme for range minimum queries. The advantages of array-based data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on real-world data from computational biology and demonstrate that the approach also works well in practice.