A note on the height of suffix trees
SIAM Journal on Computing
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Autocorrelation on words and its applications: analysis of suffix trees by string-ruler approach
Journal of Combinatorial Theory Series A
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Color Set Size Problem with Application to String Matching
CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Replacing suffix trees with enhanced suffix arrays
Journal of Discrete Algorithms - SPIRE 2002
Linear work suffix array construction
Journal of the ACM (JACM)
A space efficient solution to the frequent string mining problem for many databases
Data Mining and Knowledge Discovery
Space Efficient String Mining under Frequency Constraints
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Permuted Longest-Common-Prefix Array
CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Optimal string mining under frequency constraints
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Practical Efficient String Mining
IEEE Transactions on Knowledge and Data Engineering
Hi-index | 0.00 |
The goal of frequency constrained string mining is to extract substrings that discriminate two (or more) datasets. Known solutions to the problem range from an optimal time algorithm to different time---space tradeoffs. However, all of the existing algorithms have been designed to be run in a sequential manner and require that the whole input fits the main memory. Due to these limitations, the existing algorithms are practical only up to a few gigabytes of input. We introduce a distributed algorithm that has a novel time---space tradeoff and, in practice, achieves a significant reduction in both memory and time compared to state-of-the-art methods. To demonstrate the feasibility of the new algorithm, our study includes comprehensive tests on large-scale metagenomics data. We also study the cost of renting the required infrastructure from, e.g. Amazon EC2. Our distributed algorithm is shown to be practical on terabyte-scale inputs and affordable on rented infrastructure.