Best fitting fixed-length substring patterns for a set of strings

Authors:
Hirotaka Ono;Yen Kaow Ng
Affiliations:
Department of Computer Science and Communication Engineering, Kyushu University, Fukuoka, Japan;Graduate School of Computer Science and Systems Engineering, Kyushu Institute of Technology, Iizuka, Japan
Venue:
COCOON'05 Proceedings of the 11th annual international conference on Computing and Combinatorics
Year:
2005

Citing 8
Cited 3

Counting and random generation of strings in regular languages

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Polynomial Time Inference of Extended Regular Pattern Languages

Proceedings of RIMS Symposium on Software Science and Engineering
Discovering Unbounded Unions of Regular Pattern Languages from Positive Examples (Extended Abstract)

ISAAC '96 Proceedings of the 7th International Symposium on Algorithms and Computation
Finding Minimal Generalizations for Unions of Pattern Languages and Its Application to Inductive Inference from Positive Data

STACS '94 Proceedings of the 11th Annual Symposium on Theoretical Aspects of Computer Science
Characteristic Sets for Unions of Regular Pattern Languages and Compactness

ALT '98 Proceedings of the 9th International Conference on Algorithmic Learning Theory
Compactness and Learning of Classes of Unions of Erasing Regular Pattern Languages

ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
RE-tree: an efficient index structure for regular expressions

The VLDB Journal — The International Journal on Very Large Data Bases

An efficient motif discovery algorithm with unknown motif length and number of binding sites

International Journal of Data Mining and Bioinformatics
Inferring unions of the pattern languages by the most fitting covers

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Measuring over-generalization in the minimal multiple generalizations of biosequences

DS'05 Proceedings of the 8th international conference on Discovery Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding a pattern, or a set of patterns that best characterizes a set of strings is considered important in the context of Knowledge Discovery as applied in Molecular Biology. Our main objective is to address the problem of “over-generalization”, which is the phenomenon that a characterization is so general that it potentially includes many incorrect examples. To overcome this we formally define a criteria for a most fitting language for a set of strings, via a natural notion of density. We show how the problem can be solved by solving the membership problem and counting problem, and we study the runtime complexities of the problem with respect to three solution spaces derived from unions of the languages generated from fixed-length substring patterns. Two of these we show to be solvable in time polynomial to the input size. In the third case, however, the problem turns out to be NP-complete.