String analysis by sliding positioning strategy

Authors:
Manuel Baena-García;José M. Carmona-Cejudo;Rafael Morales-Bueno
Affiliations:
Dpto. Informática, Clínica Rincón Bejar, 29740, Torre del Mar, Málaga, Spain and Dpto. Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071, Mála ...;Dpto. Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071, Málaga, Spain;Dpto. Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071, Málaga, Spain
Venue:
Journal of Computer and System Sciences
Year:
2014

Citing 14
Cited 0

The KDD process for extracting useful knowledge from volumes of data

Communications of the ACM
Beyond market baskets: generalizing association rules to correlations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Trie memory

Communications of the ACM
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
Online Suffix Trees with Counts

DCC '04 Proceedings of the Conference on Data Compression
A space efficient solution to the frequent string mining problem for many databases

Data Mining and Knowledge Discovery
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Space Efficient String Mining under Frequency Constraints

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Mining interestingness measures for string pattern mining

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Mining interestingness measures for string pattern mining

Knowledge-Based Systems
ERA: efficient serial and parallel suffix tree construction for very long strings

Proceedings of the VLDB Endowment
Optimal string mining under frequency constraints

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Frequent itemset minning with trie data structure and parallel execution with PVM

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

Discovering frequent factors from long strings is an important problem in many applications, such as biosequence mining. In classical approaches, the algorithms process a vast database of small strings. However, in this paper we analyze a small database of long strings. The main difference resides in the high number of patterns to analyze. To tackle the problem, we have developed a new algorithm for discovering frequent factors in long strings. We present an Apriori-like solution which exploits the fact that any super-pattern of a non-frequent pattern cannot be frequent. The SANSPOS algorithm does a multiple-pass, candidate generation and test approach. Multiple length patterns can be generated in a pass. This algorithm uses a new data structure to arrange nodes in a trie. A Positioning Matrix is defined as a new positioning strategy. By using Positioning Matrices, we can apply advanced prune heuristics in a trie with a minimal computational cost. The Positioning Matrices let us process strings including Short Tandem Repeats and calculate different interestingness measures efficiently. Furthermore, in our algorithm we apply parallelism to transverse different sections of the input strings concurrently, speeding up the resulting running time. The algorithm has been successfully used in natural language and biological sequence contexts.