Mining Semi-structured Data by Path Expressions

Authors:
Katsuaki Taniguchi;Hiroshi Sakamoto;Hiroki Arimura;Shinichi Shimozono;Setsuo Arikawa
Affiliations:
-;-;-;-;-
Venue:
DS '01 Proceedings of the 4th International Conference on Discovery Science
Year:
2001

Citing 12
Cited 2

A theory of the learnable

Communications of the ACM
Combinatorial pattern discovery for scientific data: some preliminary results

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Learning page-independent heuristics for extracting data from Web pages

WWW '99 Proceedings of the eighth international conference on World Wide Web
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
Queries and Concept Learning

Machine Learning
Queries and Concept Learning

Machine Learning
Identification of Tree Translation Rules from Examples

ICGI '00 Proceedings of the 5th International Colloquium on Grammatical Inference: Algorithms and Applications
A Unifying Approach to HTML Wrapper Representation and Learning

DS '00 Proceedings of the Third International Conference on Discovery Science
Polynomial time approximation schemes for Euclidean TSP and other geometric problems

FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science

Knowledge Discovery from Semistructured Texts

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Efficient Text Mining with Optimized Pattern Discovery

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new data model for filtering semi-structured texts is presented. Given positive and negative examples of HTML pages labeled by a labelling function, the HTML pages are divided into a set of paths using the XML parser. A path is a sequence of element nodes and text nodes such that a text node appears in only the tail of the path. The labels of an element node and a text node are called a tag and a text, respectively. The goal of a mining algorithm is to find an interesting pattern, called association path, which is a pair of a tag-sequence t and a word-sequence w represented by the word-association pattern [1]. An association path (t,w) agrees with a labelling function on a path p if t is a subsequence of the tag-sequence of p and w matches with the text of p iff p is in a positive example. The importance of such an associate path 驴 is measured by the agreement of a labelling function on given data, i.e., the number of paths on which a agrees with the labelling function. We present a mining algorithm for this problem and show the efficiency of this model by experiments.