Computational geometry: an introduction
Computational geometry: an introduction
Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
An algorithm for string matching with a sequence of don't cares
Information Processing Letters
A note on the height of suffix trees
SIAM Journal on Computing
Decision theoretic generalizations of the PAC model for neural net and other learning applications
Information and Computation
Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Efficient agnostic PAC-learning with simple hypothesis
COLT '94 Proceedings of the seventh annual conference on Computational learning theory
Combinatorial pattern discovery for scientific data: some preliminary results
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Journal of Computer and System Sciences
Toward Efficient Agnostic Learning
Machine Learning - Special issue on computational learning theory, COLT'92
Concept learning with geometric hypotheses
COLT '95 Proceedings of the eighth annual conference on Computational learning theory
Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Challenges in machine learning for text classification
COLT '96 Proceedings of the ninth annual conference on Computational learning theory
Learning unions of tree patterns using queries
Theoretical Computer Science - Special issue on algorithmic learning theory
Suffix arrays: a new method for on-line string searches
SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
ICDT '97 Proceedings of the 6th International Conference on Database Theory
Knowledge Discovery in Databases: An Attribute-Oriented Approach
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
A Linear-Time Algorithm for Computing Characteristic Strings
ISAAC '94 Proceedings of the 5th International Symposium on Algorithms and Computation
STACS '94 Proceedings of the 11th Annual Symposium on Theoretical Aspects of Computer Science
Discovering Unordered and Ordered Phrase Association Patterns for Text Mining
PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Visualization and Analysis of Web Graphs
Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Efficient Data Mining from Large Text Databases
Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Efficient Discovery of Proximity Patterns with Suffix Arrays
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Efficient Text Mining with Optimized Pattern Discovery
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
An Efficient Tool for Discovering Simple Combinatorial Patterns from Large Text Databases
DS '98 Proceedings of the First International Conference on Discovery Science
Characteristic Sets of Strings Common to Semi-structured Documents
DS '99 Proceedings of the Second International Conference on Discovery Science
Extraction Positive and Negative Keywords for Web Communities
DS '00 Proceedings of the Third International Conference on Discovery Science
A Practical Algorithm to Find the Best Subsequence Patterns
DS '00 Proceedings of the Third International Conference on Discovery Science
Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words
ISAAC '98 Proceedings of the 9th International Symposium on Algorithms and Computation
Transducer inference by assembling specific languages
ICGI'10 Proceedings of the 10th international colloquium conference on Grammatical inference: theoretical results and applications
Fast q-gram mining on SLP compressed strings
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A new family of string classifiers based on local relatedness
DS'06 Proceedings of the 9th international conference on Discovery Science
Protein motif prediction by grammatical inference
ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Fast q-gram mining on SLP compressed strings
Journal of Discrete Algorithms
Hi-index | 0.00 |
We consider a data mining problem in a large collection of unstructured texts based on association rules over subwords of texts. A two-words association pattern is an expression such as (TATA, 30, AGGAGGT) ⇒ C that expresses a rule that if a text contains a subword TATA followed by another subword AGGAGGT with distance no more than 30 letters then a property C will hold with high probability. The optimized confidence pattern problem is to compute frequent patterns (α, k, β) that optimize the confidence with respect to a given collection of texts. Although this problem is solved in polynomial time by a straightforward algorithm that enumerates all the possible patterns in time O(n5), we focus on the development of more efficient algorithms that can be applied to large text databases. We present an algorithm that solves the optimized confidence pattern problem in time O(max{k, m}n2) and space O(kn), where m and n are the number and the total length of classification examples, respectively, and k is a small constant around 30 ∼ 50. This algorithm combines the suffix tree data structure in combinatorial string matching and the orthogonal range query technique in computational geometry for fast computation. Furthermore for most random texts like DNA sequences, we show that a modification of the algorithm runs very efficiently in time O(kn log3 n) and space O(kn). We also discuss some heuristics such as sampling and pruning as practical improvement. Then, we evaluate the efficiency and the performance of the algorithm with experiments on genetic sequences. A relationship with efficient Agnostic PAC-learning is also discussed.