The KDD process for extracting useful knowledge from volumes of data
Communications of the ACM
Beyond market baskets: generalizing association rules to correlations
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Communications of the ACM
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree
LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
Online Suffix Trees with Counts
DCC '04 Proceedings of the Conference on Data Compression
A space efficient solution to the frequent string mining problem for many databases
Data Mining and Knowledge Discovery
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Space Efficient String Mining under Frequency Constraints
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Mining interestingness measures for string pattern mining
IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Mining interestingness measures for string pattern mining
Knowledge-Based Systems
ERA: efficient serial and parallel suffix tree construction for very long strings
Proceedings of the VLDB Endowment
Optimal string mining under frequency constraints
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Frequent itemset minning with trie data structure and parallel execution with PVM
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Hi-index | 0.00 |
Discovering frequent factors from long strings is an important problem in many applications, such as biosequence mining. In classical approaches, the algorithms process a vast database of small strings. However, in this paper we analyze a small database of long strings. The main difference resides in the high number of patterns to analyze. To tackle the problem, we have developed a new algorithm for discovering frequent factors in long strings. We present an Apriori-like solution which exploits the fact that any super-pattern of a non-frequent pattern cannot be frequent. The SANSPOS algorithm does a multiple-pass, candidate generation and test approach. Multiple length patterns can be generated in a pass. This algorithm uses a new data structure to arrange nodes in a trie. A Positioning Matrix is defined as a new positioning strategy. By using Positioning Matrices, we can apply advanced prune heuristics in a trie with a minimal computational cost. The Positioning Matrices let us process strings including Short Tandem Repeats and calculate different interestingness measures efficiently. Furthermore, in our algorithm we apply parallelism to transverse different sections of the input strings concurrently, speeding up the resulting running time. The algorithm has been successfully used in natural language and biological sequence contexts.