Measuring over-generalization in the minimal multiple generalizations of biosequences

Authors:
Yen Kaow Ng;Hirotaka Ono;Takeshi Shinohara
Affiliations:
Graduate School of Computer Science and Systems, Kyushu Institute of Technology, Iizuka, Japan;Department of Computer Science and Communication Engineering, Kyushu University, Fukuoka, Japan;Department of Artificial Intelligence, Kyushu Institute of Technology, Iizuka, Japan
Venue:
DS'05 Proceedings of the 8th international conference on Discovery Science
Year:
2005

Citing 8
Cited 4

Counting and random generation of strings in regular languages

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Polynomial Time Inference of Extended Regular Pattern Languages

Proceedings of RIMS Symposium on Software Science and Engineering
Finding Minimal Generalizations for Unions of Pattern Languages and Its Application to Inductive Inference from Positive Data

STACS '94 Proceedings of the 11th Annual Symposium on Theoretical Aspects of Computer Science
Compactness and Learning of Classes of Unions of Erasing Regular Pattern Languages

ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
RE-tree: an efficient index structure for regular expressions

The VLDB Journal — The International Journal on Very Large Data Bases
Inferring unions of the pattern languages by the most fitting covers

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Best fitting fixed-length substring patterns for a set of strings

COCOON'05 Proceedings of the 11th annual international conference on Computing and Combinatorics

Developments from enquiries into the learnability of the pattern languages from positive data

Theoretical Computer Science
Finding consensus patterns in very scarce biosequence samples from their minimal multiple generalizations

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Inferring unions of the pattern languages by the most fitting covers

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
Characteristic sets for inferring the unions of the tree pattern languages by the most fitting hypotheses

ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of finding a set of patterns that best characterizes a set of strings. To this end, Arimura et. al. [3] considered the use of minimal multiple generalizations (mmg) for such characterizations. Given any sample set, the mmgs are, roughly speaking, the most (syntactically) specific set of languages containing the sample within a given class of languages. Takae et. al. [17] found the mmgs of the class of pattern languages [1] which includes so-called sort symbols to be fairly accurate as predictors for signal peptides. We first reproduce their results using updated data. Then, by using a measure for estimating the level of over-generalizations made by the mmgs, we show results that explain the high level of accuracies resulting from the use of sort symbols, and discuss how better results can be obtained. The measure that we suggests here can also be applied to other types of patterns, e.g. the PROSITE patterns [4].