Characteristic Sets of Strings Common to Semi-structured Documents

Authors:
Daisuke Ikeda
Affiliations:
-
Venue:
DS '99 Proceedings of the Second International Conference on Discovery Science
Year:
1999

Citing 12
Cited 3

A note on the height of suffix trees

SIAM Journal on Computing
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Toward Efficient Agnostic Learning

Machine Learning - Special issue on computational learning theory, COLT'92
Efficiently mining long patterns from databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Bounds on the Complexity of the Longest Common Subsequence Problem

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Mining in the Phrasal Frontier

PKDD '97 Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery
A Linear-Time Algorithm for Computing Characteristic Strings

ISAAC '94 Proceedings of the 5th International Symposium on Algorithms and Computation
A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases

ALT '98 Proceedings of the 9th International Conference on Algorithmic Learning Theory
Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words

ISAAC '98 Proceedings of the 9th International Symposium on Algorithms and Computation

Visualization and Analysis of Web Graphs

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Extraction Positive and Negative Keywords for Web Communities

DS '00 Proceedings of the Third International Conference on Discovery Science
Mining Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x1,...., xd) of strings such that each xi is a suffix of xi+1 and all xi's appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user's defined macros. For example, ("set", "characteristic set", "〈/title〉 characteristic set") is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n2hd) time, where n is the total length of documents and h is the height of the suffix tree of the documents.