Efficient String Mining under Constraints Via the Deferred Frequency Index

Authors:
David Weese;Marcel H. Schulz
Affiliations:
Department of Computer Science, Free University of Berlin, Berlin, Germany 14195;Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany and, International Max Planck Research School for Computational Biolo ...
Venue:
ICDM '08 Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects
Year:
2008

Citing 17
Cited 0

A comparison of imperative and purely functional suffix tree constructions

ESOP '94 Selected papers of ESOP '94, the 5th European symposium on Programming
Efficient mining of emerging patterns: discovering trends and differences

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
Reducing the space requirement of suffix trees

Software—Practice & Experience
Data Mining Techniques: For Marketing, Sales, and Customer Support

Data Mining Techniques: For Marketing, Sales, and Customer Support
Making Use of the Most Expressive Jumping Emerging Patterns for Classification

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Mining Emerging Substrings

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
A Theory of Inductive Query Answering

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Fast Frequent String Mining Using Suffix Arrays

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Looking for monotonicity properties of a similarity constraint on sequences

Proceedings of the 2006 ACM symposium on Applied computing
A new representation for protein secondary structure prediction based on frequent patterns

Bioinformatics
Mining minimal distinguishing subsequence patterns with gap constraints

Knowledge and Information Systems
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
An efficient algorithm for mining string databases under constraints

KDID'04 Proceedings of the Third international conference on Knowledge Discovery in Inductive Databases
Optimal string mining under frequency constraints

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Theoretical and practical improvements on the RMQ-Problem, with applications to LCA and LCE

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel algorithm based on a deferred data structure. Despite its simplicity, our approach is up to 4 times faster and uses about half the memory compared to the best-known algorithm of Fischer et al. Applications in various string domains, e.g. natural language, DNA or protein sequences, demonstrate the improvement of our algorithm.