A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

Authors:
Hiroki Arimura;Atsushi Wataki;Ryoichi Fujino;Setsuo Arikawa
Affiliations:
-;-;-;-
Venue:
ALT '98 Proceedings of the 9th International Conference on Algorithmic Learning Theory
Year:
1998

Citing 20
Cited 15

Computational geometry: an introduction

Computational geometry: an introduction
Handbook of algorithms and data structures: in Pascal and C (2nd ed.)

Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
An algorithm for string matching with a sequence of don't cares

Information Processing Letters
A note on the height of suffix trees

SIAM Journal on Computing
Decision theoretic generalizations of the PAC model for neural net and other learning applications

Information and Computation
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Efficient agnostic PAC-learning with simple hypothesis

COLT '94 Proceedings of the seventh annual conference on Computational learning theory
Combinatorial pattern discovery for scientific data: some preliminary results

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Dynamic dictionary matching

Journal of Computer and System Sciences
Toward Efficient Agnostic Learning

Machine Learning - Special issue on computational learning theory, COLT'92
Concept learning with geometric hypotheses

COLT '95 Proceedings of the eighth annual conference on Computational learning theory
Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Challenges in machine learning for text classification

COLT '96 Proceedings of the ninth annual conference on Computational learning theory
Learning unions of tree patterns using queries

Theoretical Computer Science - Special issue on algorithmic learning theory
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Knowledge Discovery in Databases: An Attribute-Oriented Approach

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
A Linear-Time Algorithm for Computing Characteristic Strings

ISAAC '94 Proceedings of the 5th International Symposium on Algorithms and Computation
Finding Minimal Generalizations for Unions of Pattern Languages and Its Application to Inductive Inference from Positive Data

STACS '94 Proceedings of the 11th Annual Symposium on Theoretical Aspects of Computer Science

Discovering Unordered and Ordered Phrase Association Patterns for Text Mining

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Visualization and Analysis of Web Graphs

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Efficient Data Mining from Large Text Databases

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Efficient Discovery of Proximity Patterns with Suffix Arrays

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Efficient Text Mining with Optimized Pattern Discovery

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
An Efficient Tool for Discovering Simple Combinatorial Patterns from Large Text Databases

DS '98 Proceedings of the First International Conference on Discovery Science
Characteristic Sets of Strings Common to Semi-structured Documents

DS '99 Proceedings of the Second International Conference on Discovery Science
Extraction Positive and Negative Keywords for Web Communities

DS '00 Proceedings of the Third International Conference on Discovery Science
A Practical Algorithm to Find the Best Subsequence Patterns

DS '00 Proceedings of the Third International Conference on Discovery Science
Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words

ISAAC '98 Proceedings of the 9th International Symposium on Algorithms and Computation
Transducer inference by assembling specific languages

ICGI'10 Proceedings of the 10th international colloquium conference on Grammatical inference: theoretical results and applications
Fast q-gram mining on SLP compressed strings

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A new family of string classifiers based on local relatedness

DS'06 Proceedings of the 9th international conference on Discovery Science
Protein motif prediction by grammatical inference

ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Fast q-gram mining on SLP compressed strings

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-words association pattern is an expression such as (TATA, 30, AGGAGGT) ⇒ C that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with high probability. The optimized confidence pattern problem is to compute frequent patterns (α, k, β) that optimize the confidence with respect to a given collection of texts. Although this problem is solved in polynomial time by a straightforward algorithm that enumerates all the possible patterns in time O(n5), we focus on the development of more efficient algorithms that can be applied to large text databases. We present an algorithm that solves the optimized confidence pattern problem in time O(max{k, m}n2) and space O(kn), where m and n are the number and the total length of classification examples, respectively, and k is a small constant around 30 ∼ 50. This algorithm combines the suffix tree data structure in combinatorial string matching and the orthogonal range query technique in computational geometry for fast computation. Furthermore for most random texts like DNA sequences, we show that a modification of the algorithm runs very efficiently in time O(kn log3 n) and space O(kn). We also discuss some heuristics such as sampling and pruning as practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences. A relationship with efficient Agnostic PAC-learning is also discussed.