Mining Semi-structured Data by Path Expressions

  • Authors:
  • Katsuaki Taniguchi;Hiroshi Sakamoto;Hiroki Arimura;Shinichi Shimozono;Setsuo Arikawa

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • DS '01 Proceedings of the 4th International Conference on Discovery Science
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

A new data model for filtering semi-structured texts is presented. Given positive and negative examples of HTML pages labeled by a labelling function, the HTML pages are divided into a set of paths using the XML parser. A path is a sequence of element nodes and text nodes such that a text node appears in only the tail of the path. The labels of an element node and a text node are called a tag and a text, respectively. The goal of a mining algorithm is to find an interesting pattern, called association path, which is a pair of a tag-sequence t and a word-sequence w represented by the word-association pattern [1]. An association path (t,w) agrees with a labelling function on a path p if t is a subsequence of the tag-sequence of p and w matches with the text of p iff p is in a positive example. The importance of such an associate path 驴 is measured by the agreement of a labelling function on given data, i.e., the number of paths on which a agrees with the labelling function. We present a mining algorithm for this problem and show the efficiency of this model by experiments.