Extracting Characteristic Structures among Words in Semistructured Documents

  • Authors:
  • Kazuyoshi Furukawa;Tomoyuki Uchida;Kazuya Yamada;Tetsuhiro Miyahara;Takayoshi Shoudai;Yasuaki Nakamura

  • Affiliations:
  • -;-;-;-;-;-

  • Venue:
  • PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistructured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W1,W2, . . . , Wk) be a list of words which are sorted in lexicographical order and let k 驴 2 be an integer. Firstly, we define a tree-association pattern on (W1,W2, ..., Wk). A tree-association pattern on (W1,W2, . . . , Wk) is a sequence 驴t1; t2; ... ; tk-1驴 of labeled rooted trees such that, for i = 1, 2, ..., k - 1, (1) ti consists of only one node having the pair of two words Wi and Wi+1 as its label, or (2) ti is a labeled rooted tree which has just two leaves labeled with Wi and Wi+1, respectively. Next, we present a text mining algorithm for finding all frequent tree-association patterns in semistructured documents. Finally, by reporting experimental results on our algorithm, we show that our algorithm is effective for extracting structural characteristics in semistructured documents.